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Preface 


This  book  is  meant  to  be  a  textbook  for  a  standard  one-semester  introductory 
statistics  course  for  general  education  students.  Our  motivation  for  writing  it  is 
twofold:  1.)  to  provide  a  low-cost  alternative  to  many  existing  popular  textbooks  on 
the  market;  and  2.)  to  provide  a  quality  textbook  on  the  subject  with  a  focus  on  the 
core  material  of  the  course  in  a  balanced  presentation. 


The  high  cost  of  textbooks  has  spiraled  out  of  control  in  recent  years.  The  high 
frequency  at  which  new  editions  of  popular  texts  appear  puts  a  tremendous  burden 
on  students  and  faculty  alike,  as  well  as  the  natural  environment.  Against  this 
background  we  set  out  to  write  a  quality  textbook  with  materials  such  as  examples 
and  exercises  that  age  well  with  time  and  that  would  therefore  not  require  frequent 
new  editions.  Our  vision  resonates  well  with  the  publisher’s  business  model  which 
includes  free  digital  access,  reduced  paper  prints,  and  easy  customization  by 
instructors  if  additional  material  is  desired. 


Over  time  the  core  content  of  this  course  has  developed  into  a  well-defined  body  of 
material  that  is  substantial  for  a  one-semester  course.  The  authors  believe  that  the 
students  in  this  course  are  best  served  by  a  focus  on  the  core  material  and  not  by  an 
exposure  to  a  plethora  of  peripheral  topics.  Therefore  in  writing  this  book  we  have 
sought  to  present  material  that  comprises  fully  a  central  body  of  knowledge  that  is 
defined  according  to  convention,  realistic  expectation  with  respect  to  course 
duration  and  students’  maturity  level,  and  our  professional  judgment  and 
experience.  We  believe  that  certain  topics,  among  them  Poisson  and  geometric 
distributions  and  the  normal  approximation  to  the  binomial  distribution 
(particularly  with  a  continuity  correction)  are  distracting  in  nature.  Other  topics, 
such  as  nonparametric  methods,  while  important,  do  not  belong  in  a  first  course  in 
statistics.  As  a  result  we  envision  a  smaller  and  less  intimidating  textbook  that 
trades  some  extended  and  unnecessary  topics  for  a  better  focused  presentation  of 
the  central  material. 


Textbooks  for  this  course  cover  a  wide  range  in  terms  of  simplicity  and  complexity. 
Some  popular  textbooks  emphasize  the  simplicity  of  individual  concepts  to  the 
point  of  lacking  the  coherence  of  an  overall  network  of  concepts.  Other  textbooks 
include  overly  detailed  conceptual  and  computational  discussions  and  as  a  result 
repel  students  from  reading  them.  The  authors  believe  that  a  successful  book  must 
strike  a  balance  between  the  two  extremes,  however  difficult  it  may  be.  As  a 
consequence  the  overarching  guiding  principle  of  our  writing  is  to  seek  simplicity 
but  to  preserve  the  coherence  of  the  whole  body  of  information  communicated, 
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both  conceptually  and  computationally.  We  seek  to  remind  ourselves  (and  others) 
that  we  teach  ideas,  not  just  step-by-step  algorithms,  but  ideas  that  can  be 
implemented  by  straightforward  algorithms. 


In  our  experience  most  students  come  to  an  introductory  course  in  statistics  with  a 
calculator  that  they  are  familiar  with  and  with  which  their  proficiency  is  more  than 
adequate  for  the  course  material.  If  the  instructor  chooses  to  use  technological  aids, 
either  calculators  or  statistical  software  such  as  Minitab  or  SPSS,  for  more  than 
mere  arithmetical  computations  but  as  a  significant  component  of  the  course  then 
effective  instruction  for  their  use  will  require  more  extensive  written  instruction 
than  a  mere  paragraph  or  two  in  the  text.  Given  the  plethora  of  such  aids  available, 
to  discuss  a  few  of  them  would  not  provide  sufficiently  wide  or  detailed  coverage 
and  to  discuss  many  would  digress  unnecessarily  from  the  conceptual  focus  of  the 
book.  The  overarching  philosophy  of  this  textbook  is  to  present  the  core  material  of 
an  introductory  course  in  statistics  for  non-majors  in  a  complete  yet  streamlined 
way.  Much  room  has  been  intentionally  left  for  instructors  to  apply  their  own 
instructional  styles  as  they  deem  appropriate  for  their  classes  and  educational 
goals.  We  believe  that  the  whole  matter  of  what  technological  aids  to  use,  and  to 
what  extent,  is  precisely  the  type  of  material  best  left  to  the  instructor’s  discretion. 


All  figures  with  the  exception  of  Figure  1.1  "The  Grand  Picture  of  Statistics",  Figure 
2,1  "Stem  and  Leaf  Diagram",  Figure  2.2  "Ordered  Stem  and  Leaf  Diagram",  Figure 
2.13  "The  Box  Plot"  Figure  10.4  "Linear  Correlation  Coefficient  ",  Figure  10,5  "The 
Simple  Linear  Model  Concept",  and  the  unnumbered  figure  in  Note  2.50  "Example 
16"  of  Chapter  2  "Descriptive  Statistics"  were  generated  using  MATLAB,  copyright 
2010. 


6 


Chapter  1 
Introduction 


In  this  chapter  we  will  introduce  some  basic  terminology  and  lay  the  groundwork 
for  the  course.  We  will  explain  in  general  terms  what  statistics  and  probability  are 
and  the  problems  that  these  two  areas  of  study  are  designed  to  solve. 
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1.1  Basic  Definitions  and  Concepts 


LEARNING  OBJECTIVE 

1.  To  learn  the  basic  definitions  used  in  statistics  and  some  of  its  key 
concepts. 


We  begin  with  a  simple  example.  There  are  millions  of  passenger  automobiles  in  the 
United  States.  What  is  their  average  value?  It  is  obviously  impractical  to  attempt  to 
solve  this  problem  directly  by  assessing  the  value  of  every  single  car  in  the  country, 
adding  up  all  those  numbers,  and  then  dividing  by  however  many  numbers  there 
are.  Instead,  the  best  we  can  do  would  be  to  estimate  the  average.  One  natural  way 
to  do  so  would  be  to  randomly  select  some  of  the  cars,  say  200  of  them,  ascertain  the 
value  of  each  of  those  cars,  and  find  the  average  of  those  200  numbers.  The  set  of  all 
those  millions  of  vehicles  is  called  the  population  of  interest,  and  the  number 
attached  to  each  one,  its  value,  is  a  measurement.  The  average  value  is  a  parameter :  a 
number  that  describes  a  characteristic  of  the  population,  in  this  case  monetary 
worth.  The  set  of  200  cars  selected  from  the  population  is  called  a  sample,  and  the 
200  numbers,  the  monetary  values  of  the  cars  we  selected,  are  the  sample  data.  The 
average  of  the  data  is  called  a  statistic:  a  number  calculated  from  the  sample  data. 
This  example  illustrates  the  meaning  of  the  following  definitions. 


1.  All  objects  of  interest. 

2.  The  objects  examined. 

3.  A  number  or  attribute 
computed  for  each  member  of 
a  set  of  objects. 

4.  The  measurements  from  a 
sample. 
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Definition 

A  parameter5  is  a  number  that  summarizes  some  aspect  of  the  population  as  a  whole. 
A  statistic6  is  a  number  computed  from  the  sample  data. 


Continuing  with  our  example,  if  the  average  value  of  the  cars  in  our  sample  was 
$8,357,  then  it  seems  reasonable  to  conclude  that  the  average  value  of  all  cars  is 
about  $8,357.  In  reasoning  this  way  we  have  drawn  an  inference  about  the 
population  based  on  information  obtained  from  the  sample.  In  general,  statistics  is  a 
study  of  data:  describing  properties  of  the  data,  which  is  called  descriptive  statistics, 
and  drawing  conclusions  about  a  population  of  interest  from  information  extracted 
from  a  sample,  which  is  called  inferential  statistics.  Computing  the  single  number 
$8,357  to  summarize  the  data  was  an  operation  of  descriptive  statistics;  using  it  to 
make  a  statement  about  the  population  was  an  operation  of  inferential  statistics. 


Definition 

Statistics7  is  a  collection  of  methods  for  collecting,  displaying,  analyzing,  and  drawing 
conclusions  from  data. 


Definition 


Descriptive  statistics8  is  the  branch  of  statistics  that  involves  organizing,  displaying, 
and  describing  data. 

5.  A  number  that  summarizes 
some  aspect  of  the  population. 

6.  A  number  computed  from  the 
sample  data. 

7.  Collection,  display,  analysis, 
and  inference  from  data. 

8.  The  organization,  display,  and 
description  of  data. 
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Definition 

Inferential  statistics9  is  the  branch  of  statistics  that  involves  drawing  conclusions 
about  a  population  based  on  information  contained  in  a  sample  taken  from  that 
population. 


The  measurement  made  on  each  element  of  a  sample  need  not  be  numerical.  In  the 
case  of  automobiles,  what  is  noted  about  each  car  could  be  its  color,  its  make,  its 
body  type,  and  so  on.  Such  data  are  categorical  or  qualitative,  as  opposed  to  numerical 
or  quantitative  data  such  as  value  or  age.  This  is  a  general  distinction. 


Definition 


Qualitative  data10  are  measurements  for  which  there  is  no  natural  numerical  scale, 
but  which  consist  of  attributes,  labels,  or  other  nonnumerical  characteristics. 


Definition 


Quantitative  data11  are  numerical  measurements  that  arise  from  a  natural 
numerical  scale. 


9.  Drawing  conclusions  about  a 
population  based  on  a  sample. 

10.  Measurements  for  which  there 
is  no  natural  numerical  scale. 


Qualitative  data  can  generate  numerical  sample  statistics.  In  the  automobile 
example,  for  instance,  we  might  be  interested  in  the  proportion  of  all  cars  that  are 
less  than  six  years  old.  In  our  same  sample  of  200  cars  we  could  note  for  each  car 
whether  it  is  less  than  six  years  old  or  not,  which  is  a  qualitative  measurement.  If 
172  cars  in  the  sample  are  less  than  six  years  old,  which  is  0.86  or  86%,  then  we 
would  estimate  the  parameter  of  interest,  the  population  proportion,  to  be  about 
the  same  as  the  sample  statistic,  the  sample  proportion,  that  is,  about  0.86. 


11.  Numerical  measurements  that 
arise  from  a  natural  numerical 
scale. 


The  relationship  between  a  population  of  interest  and  a  sample  drawn  from  that 
population  is  perhaps  the  most  important  concept  in  statistics,  since  everything 
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else  rests  on  it.  This  relationship  is  illustrated  graphically  in  Figure  1.1  "The  Grand 
Picture  of  Statistics".  The  circles  in  the  large  box  represent  elements  of  the 
population.  In  the  figure  there  was  room  for  only  a  small  number  of  them  but  in 
actual  situations,  like  our  automobile  example,  they  could  very  well  number  in  the 
millions.  The  solid  black  circles  represent  the  elements  of  the  population  that  are 
selected  at  random  and  that  together  form  the  sample.  For  each  element  of  the 
sample  there  is  a  measurement  of  interest,  denoted  by  a  lower  case  x  (which  we 
have  indexed  as  X\ ,  ...  ,xn  to  tell  them  apart);  these  measurements  collectively 
form  the  sample  data  set.  From  the  data  we  may  calculate  various  statistics.  To 
anticipate  the  notation  that  will  be  used  later,  we  might  compute  the  sample  mean 
X  and  the  sample  proportion  p ,  and  take  them  as  approximations  to  the  population 
mean  p  (this  is  the  lower  case  Greek  letter  mu,  the  traditional  symbol  for  this 
parameter)  and  the  population  proportion  p,  respectively.  The  other  symbols  in  the 
figure  stand  for  other  parameters  and  statistics  that  we  will  encounter. 


Figure  1.1  The  Grand  Picture  of  Statistics 
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KEY  TAKEAWAYS 


•  Statistics  is  a  study  of  data:  describing  properties  of  data  (descriptive 
statistics)  and  drawing  conclusions  about  a  population  based  on 
information  in  a  sample  (inferential  statistics). 

•  The  distinction  between  a  population  together  with  its  parameters  and  a 
sample  together  with  its  statistics  is  a  fundamental  concept  in 
inferential  statistics. 

•  Information  in  a  sample  is  used  to  make  inferences  about  the  population 
from  which  the  sample  was  drawn. 


1.1  Basic  Definitions  and  Concepts 
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EXERCISES 


1.  Explain  what  is  meant  by  the  term  population. 

2.  Explain  what  is  meant  by  the  term  sample. 

3.  Explain  how  a  sample  differs  from  a  population. 

4.  Explain  what  is  meant  by  the  term  sample  data. 

5.  Explain  what  a  parameter  is. 

6.  Explain  what  a  statistic  is. 

7.  Give  an  example  of  a  population  and  two  different  characteristics  that  may  be 
of  interest. 

8.  Describe  the  difference  between  descriptive  statistics  and  inferential  statistics. 
Illustrate  with  an  example. 

9.  Identify  each  of  the  following  data  sets  as  either  a  population  or  a  sample: 

a.  The  grade  point  averages  (GPAs)  of  all  students  at  a  college. 

b.  The  GPAs  of  a  randomly  selected  group  of  students  on  a  college  campus. 

c.  The  ages  of  the  nine  Supreme  Court  Justices  of  the  United  States  on 
January  1, 1842. 

d.  The  gender  of  every  second  customer  who  enters  a  movie  theater. 

e.  The  lengths  of  Atlantic  croakers  caught  on  a  fishing  trip  to  the  beach. 

10.  Identify  the  following  measures  as  either  quantitative  or  qualitative: 

a.  The  30  high-temperature  readings  of  the  last  30  days. 

b.  The  scores  of  40  students  on  an  English  test. 

c.  The  blood  types  of  120  teachers  in  a  middle  school. 

d.  The  last  four  digits  of  social  security  numbers  of  all  students  in  a  class. 

e.  The  numbers  on  the  jerseys  of  53  football  players  on  a  team. 

11.  Identify  the  following  measures  as  either  quantitative  or  qualitative: 

a.  The  genders  of  the  first  40  newborns  in  a  hospital  one  year. 

b.  The  natural  hair  color  of  20  randomly  selected  fashion  models. 

c.  The  ages  of  20  randomly  selected  fashion  models. 

d.  The  fuel  economy  in  miles  per  gallon  of  20  new  cars  purchased  last  month. 

e.  The  political  affiliation  of  500  randomly  selected  voters. 

12.  A  researcher  wishes  to  estimate  the  average  amount  spent  per  person  by 
visitors  to  a  theme  park.  He  takes  a  random  sample  of  forty  visitors  and 
obtains  an  average  of  $28  per  person. 
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a.  What  is  the  population  of  interest? 

b.  What  is  the  parameter  of  interest? 

c.  Based  on  this  sample,  do  we  know  the  average  amount  spent  per  person  by 
visitors  to  the  park?  Explain  fully. 

13.  A  researcher  wishes  to  estimate  the  average  weight  of  newborns  in  South 
America  in  the  last  five  years.  He  takes  a  random  sample  of  235  newborns  and 
obtains  an  average  of  3.27  kilograms. 

a.  What  is  the  population  of  interest? 

b.  What  is  the  parameter  of  interest? 

c.  Based  on  this  sample,  do  we  know  the  average  weight  of  newborns  in 
South  America?  Explain  fully. 

14.  A  researcher  wishes  to  estimate  the  proportion  of  all  adults  who  own  a  cell 
phone.  He  takes  a  random  sample  of  1,572  adults;  1,298  of  them  own  a  cell 
phone,  hence  12984572  *  .83  or  about  83%  own  a  cell  phone. 

a.  What  is  the  population  of  interest? 

b.  What  is  the  parameter  of  interest? 

c.  What  is  the  statistic  involved? 

d.  Based  on  this  sample,  do  we  know  the  proportion  of  all  adults  who  own  a 
cell  phone?  Explain  fully. 

15.  A  sociologist  wishes  to  estimate  the  proportion  of  all  adults  in  a  certain  region 
who  have  never  married.  In  a  random  sample  of  1,320  adults,  145  have  never 
married,  hence  1454320  *  .11  or  about  11%  have  never  married. 


16. 


a.  What  is  the  population  of  interest? 

b.  What  is  the  parameter  of  interest? 

c.  What  is  the  statistic  involved? 

d.  Based  on  this  sample,  do  we  know  the  proportion  of  all  adults  who  have 
never  married?  Explain  fully. 

a.  What  must  be  true  of  a  sample  if  it  is  to  give  a  reliable  estimate  of  the  value 
of  a  particular  population  parameter? 

b.  What  must  be  true  of  a  sample  if  it  is  to  give  certain  knowledge  of  the  value 
of  a  particular  population  parameter? 


1.1  Basic  Definitions  and  Concepts 
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ANSWERS 


1.  A  population  is  the  total  collection  of  objects  that  are  of  interest  in  a  statistical 
study. 

3.  A  sample,  being  a  subset,  is  typically  smaller  than  the  population.  In  a 

statistical  study,  all  elements  of  a  sample  are  available  for  observation,  which 
is  not  typically  the  case  for  a  population. 

5.  A  parameter  is  a  value  describing  a  characteristic  of  a  population.  In  a 
statistical  study  the  value  of  a  parameter  is  typically  unknown. 

7.  All  currently  registered  students  at  a  particular  college  form  a  population.  Two 
population  characteristics  of  interest  could  be  the  average  GPA  and  the 
proportion  of  students  over  23  years. 


9. 


11. 


13. 


15. 


a.  Population. 

b.  Sample. 

c.  Population. 

d.  Sample. 

e.  Sample. 

a.  Qualitative. 

b.  Qualitative. 

c.  Quantitative. 

d.  Quantitative. 

e.  Qualitative. 

a.  All  newborn  babies  in  South  America  in  the  last  five  years. 

b.  The  average  birth  weight  of  all  newborn  babies  in  South  America  in  the 
last  five  years. 

c.  No,  not  exactly,  but  we  know  the  approximate  value  of  the  average. 

a.  All  adults  in  the  region. 

b.  The  proportion  of  the  adults  in  the  region  who  have  never  married. 

c.  The  proportion  computed  from  the  sample,  0.1. 

d.  No,  not  exactly,  but  we  know  the  approximate  value  of  the  proportion. 


1.1  Basic  Definitions  and  Concepts 
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1.2  Overview 


LEARNING  OBJECTIVE 

1.  To  obtain  an  overview  of  the  material  in  the  text. 


The  example  we  have  given  in  the  first  section  seems  fairly  simple,  but  there  are 
some  significant  problems  that  it  illustrates.  We  have  supposed  that  the  200  cars  of 
the  sample  had  an  average  value  of  $8,357  (a  number  that  is  precisely  known),  and 
concluded  that  the  population  has  an  average  of  about  the  same  amount,  although 
its  precise  value  is  still  unknown.  What  would  happen  if  someone  were  to  take 
another  sample  of  exactly  the  same  size  from  exactly  the  same  population?  Would 
he  get  the  same  sample  average  as  we  did,  $8,357?  Almost  surely  not.  In  fact,  if  the 
investigator  who  took  the  second  sample  were  to  report  precisely  the  same  value, 
we  would  immediately  become  suspicious  of  his  result.  The  sample  average  is  an 
example  of  what  is  called  a  random  variable:  a  number  that  varies  from  trial  to  trial 
of  an  experiment  (in  this  case,  from  sample  to  sample),  and  does  so  in  a  way  that 
cannot  be  predicted  precisely.  Random  variables  will  be  a  central  object  of  study  for 
us,  beginning  in  Chapter  4  "Discrete  Random  Variables". 


Another  issue  that  arises  is  that  different  samples  have  different  levels  of  reliability. 
We  have  supposed  that  our  sample  of  size  200  had  an  average  of  $8,357.  if  a  sample 
of  size  1,000  yielded  an  average  value  of  $7,832,  then  we  would  naturally  regard  this 
latter  number  as  likely  to  be  a  better  estimate  of  the  average  value  of  all  cars.  How 
can  this  be  expressed?  An  important  idea  that  we  will  develop  in  Chapter  7 
"Estimation"  is  that  of  the  confidence  interval:  from  the  data  we  will  construct  an 
interval  of  values  so  that  the  process  has  a  certain  chance,  say  a  95%  chance,  of 
generating  an  interval  that  contains  the  actual  population  average.  Thus  instead  of 
reporting  a  single  estimate,  $8,357,  for  the  population  mean,  we  would  say  that  we 
are  95%  certain  that  the  true  average  is  within  $100  of  our  sample  mean,  that  is, 
between  $8,257  and  $8,457,  the  number  $100  having  been  computed  from  the 
sample  data  just  like  the  sample  mean  $8,357  was.  This  will  automatically  indicate 
the  reliability  of  the  sample,  since  to  obtain  the  same  chance  of  containing  the 
unknown  parameter  a  large  sample  will  typically  produce  a  shorter  interval  than  a 
small  one  will.  But  unless  we  perform  a  census,  we  can  never  be  completely  sure  of 
the  true  average  value  of  the  population;  the  best  that  we  can  do  is  to  make 
statements  of  probability ,  an  important  concept  that  we  will  begin  to  study  formally 
in  Chapter  3  "Basic  Concepts  of  Probability". 
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Sampling  may  be  done  not  only  to  estimate  a  population  parameter,  but  to  test  a 
claim  that  is  made  about  that  parameter.  Suppose  a  food  package  asserts  that  the 
amount  of  sugar  in  one  serving  of  the  product  is  14  grams.  A  consumer  group  might 
suspect  that  it  is  more.  How  would  they  test  the  competing  claims  about  the 
amount  of  sugar,  14  grams  versus  more  than  14  grams?  They  might  take  a  random 
sample  of  perhaps  20  food  packages,  measure  the  amount  of  sugar  in  one  serving  of 
each  one,  and  average  those  amounts.  They  are  not  interested  in  the  true  amount  of 
sugar  in  one  serving  in  itself;  their  interest  is  simply  whether  the  claim  about  the 
true  amount  is  accurate.  Stated  another  way,  they  are  sampling  not  in  order  to 
estimate  the  average  amount  of  sugar  in  one  serving,  but  to  see  whether  that 
amount,  whatever  it  may  be,  is  larger  than  14  grams.  Again  because  one  can  have 
certain  knowledge  only  by  taking  a  census,  ideas  of  probability  enter  into  the 
analysis.  We  will  examine  tests  of  hypotheses  beginning  in  Chapter  8  "Testing 
Hypotheses". 


Several  times  in  this  introduction  we  have  used  the  term  “random  sample.” 
Generally  the  value  of  our  data  is  only  as  good  as  the  sample  that  produced  it.  For 
example,  suppose  we  wish  to  estimate  the  proportion  of  all  students  at  a  large 
university  who  are  females,  which  we  denote  by  p.  if  we  select  50  students  at 
random  and  27  of  them  are  female,  then  a  natural  estimate  is 
p  «  p  —  27  /  50  =  0.54or  54%.  How  much  confidence  we  can  place  in  this 
estimate  depends  not  only  on  the  size  of  the  sample,  but  on  its  quality,  whether  or 
not  it  is  truly  random,  or  at  least  truly  representative  of  the  whole  population,  if  all 
50  students  in  our  sample  were  drawn  from  a  College  of  Nursing,  then  the 
proportion  of  female  students  in  the  sample  is  likely  higher  than  that  of  the  entire 
campus,  if  all  50  students  were  selected  from  a  College  of  Engineering  Sciences, 
then  the  proportion  of  students  in  the  entire  student  body  who  are  females  could  be 
underestimated.  In  either  case,  the  estimate  would  be  distorted  or  biased.  In 
statistical  practice  an  unbiased  sampling  scheme  is  important  but  in  most  cases  not 
easy  to  produce.  For  this  introductory  course  we  will  assume  that  all  samples  are 
either  random  or  at  least  representative. 


KEY  TAKEAWAY 


•  Statistics  computed  from  samples  vary  randomly  from  sample  to 

sample.  Conclusions  made  about  population  parameters  are  statements 
of  probability. 


1.2  Overview 
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1.3  Presentation  of  Data 


LEARNING  OBJECTIVE 

1.  To  learn  two  ways  that  data  will  be  presented  in  the  text. 


In  this  book  we  will  use  two  formats  for  presenting  data  sets.  The  first  is  a  data 
list12,  which  is  an  explicit  listing  of  all  the  individual  measurements,  either  as  a 
display  with  space  between  the  individual  measurements,  or  in  set  notation  with 
individual  measurements  separated  by  commas. 


EXAMPLE  1 


The  data  obtained  by  measuring  the  age  of  21  randomly  selected  students 
enrolled  in  freshman  courses  at  a  university  could  be  presented  as  the  data 
list 


18 

18 

19 

19 

19 

18 

22 

20 

18 

19 

18 

24 

18 

20 

18 

21 

20 

17 

18  17 

19 


or  in  set  notation  as 


{18,18,19,19,19,18,22,20,18,18,17,19,18,24,18,20,18,21,20,17,19} 


A  data  set  can  also  be  presented  by  means  of  a  data  frequency  table13,  a  table  in 
which  each  distinct  value  x  is  listed  in  the  first  row  and  its  frequency14  f  which  is 
the  number  of  times  the  value  x  appears  in  the  data  set,  is  listed  below  it  in  the 
second  row. 


12.  An  explicit  listing  of  all  the 
individual  measurements  made 
on  a  sample. 

13.  A  table  listing  each  distinct 
value  x  and  its  frequency  f. 

14.  How  often  a  value  x  appears  in 
a  data  set. 
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EXAMPLE  2 


The  data  set  of  the  previous  example  is  represented  by  the  data  frequency 
table 


X 

17 

18 

19 

20  21  22  24 

/ 

2 

8 

5 

3  111 

The  data  frequency  table  is  especially  convenient  when  data  sets  are  large  and  the 
number  of  distinct  values  is  not  too  large. 


KEY  TAKEAWAY 


•  Data  sets  can  be  presented  either  by  listing  all  the  elements  or  by  giving 
a  table  of  values  and  frequencies. 


1.3  Presentation  of  Data 
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EXERCISES 


1.  List  all  the  measurements  for  the  data  set  represented  by  the  following  data 
frequency  table. 


X 

31 

32 

33 

34 

35 

/ 

1 

5 

6 

4 

2 

2.  List  all  the  measurements  for  the  data  set  represented  by  the  following  data 
frequency  table. 


X 

97 

98 

99 

100 

101 

102  103  105 

/ 

7 

5 

3 

4 

2 

2  1  1 

3.  Construct  the  data  frequency  table  for  the  following  data  set. 


22  25  22  27  24  23 
26  24  22  24  26 


4.  Construct  the  data  frequency  table  for  the  following  data  set. 

{1,5,2,3,5,1,4,4,43,2,5,1,3,2, 

U, 1,2} 


ANSWERS 


1.  {31,32,32,32,32,32,33,33,33,33,33,33,34,34,34,34,35,35}. 
3. 


X 

22  23 

24  25 

26  27 

/ 

3  1 

3  1 

2  1 

1.3  Presentation  of  Data 
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Descriptive  Statistics 


As  described  in  Chapter  1  "Introduction",  statistics  naturally  divides  into  two 
branches,  descriptive  statistics  and  inferential  statistics.  Our  main  interest  is  in 
inferential  statistics,  as  shown  in  Figure  1.1  "The  Grand  Picture  of  Statistics"  in 
Chapter  1  "Introduction".  Nevertheless,  the  starting  point  for  dealing  with  a 
collection  of  data  is  to  organize,  display,  and  summarize  it  effectively.  These  are 
objectives  of  descriptive  statistics,  the  topic  of  this  chapter. 


Chapter  2  Descriptive  Statistics 


2.1  Three  Popular  Data  Displays 


LEARNING  OBJECTIVE 

1.  To  learn  to  interpret  the  meaning  of  three  graphical  representations  of 
sets  of  data:  stem  and  leaf  diagrams,  frequency  histograms,  and  relative 
frequency  histograms. 


A  well-known  adage  is  that  “a  picture  is  worth  a  thousand  words.”  This  saying 
proves  true  when  it  comes  to  presenting  statistical  information  in  a  data  set.  There 
are  many  effective  ways  to  present  data  graphically.  The  three  graphical  tools  that 
are  introduced  in  this  section  are  among  the  most  commonly  used  and  are  relevant 
to  the  subsequent  presentation  of  the  material  in  this  book. 

Stem  and  Leaf  Diagrams 

Suppose  30  students  in  a  statistics  class  took  a  test  and  made  the  following  scores: 


86 

80 

25 

77 

73 

76 

100 

90 

69 

93 

90 

83 

70 

73 

73 

70 

90 

83 

71 

95 

40 

58 

68 

69 

100 

78 

87 

97 

92 

74 

How  did  the  class  do  on  the  test?  A  quick  glance  at  the  set  of  30  numbers  does  not 
immediately  give  a  clear  answer.  However  the  data  set  may  be  reorganized  and 
rewritten  to  make  relevant  information  more  visible.  One  way  to  do  so  is  to 
construct  a  stem  and  leaf  diagram  as  shown  in  Figure  2.1  "Stem  and  Leaf  Diagram". 
The  numbers  in  the  tens  place,  from  2  through  9,  and  additionally  the  number  10, 
are  the  “stems,”  and  are  arranged  in  numerical  order  from  top  to  bottom  to  the  left 
of  a  vertical  line.  The  number  in  the  units  place  in  each  measurement  is  a  “leaf,” 
and  is  placed  in  a  row  to  the  right  of  the  corresponding  stem,  the  number  in  the 
tens  place  of  that  measurement.  Thus  the  three  leaves  9,  8,  and  9  in  the  row  headed 
with  the  stem  6  correspond  to  the  three  exam  scores  in  the  60s,  69  (in  the  first  row 
of  data),  68  (in  the  third  row),  and  69  (also  in  the  third  row).  The  display  is  made 
even  more  useful  for  some  purposes  by  rearranging  the  leaves  in  numerical  order, 
as  shown  in  Figure  2.2  "Ordered  Stem  and  Leaf  Diagram".  Either  way,  with  the  data 
reorganized  certain  information  of  interest  becomes  apparent  immediately.  There 
are  two  perfect  scores;  three  students  made  scores  under  60;  most  students  scored 
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in  the  70s,  80s  and  90s;  and  the  overall  average  is  probably  in  the  high  70s  or  low 
80s. 


Figure  2.1 

Stem  and  Leaf  Diagram 

2 

o 

5 

O 

4 

0 

5 

8 

6 

9 

8 

9 

7 

7 

3 

6 

0 

8 

6 

0 

3 

3 

9 

0 

3 

0 

0 

10 

0 

0 

3  3  0  1  8  4 

7 

5  7  2 


Figure  2.2  Ordered  Stem  and  Leaf  Diagram 
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3 

4 

5 

6 

7 

8 
9 

10 


0 

8 

8 

0 

0 

0 

0 


9 

0 

3 

0 

0 


9 

1 

3 

0 


3 

G 

2 


3  3  4  6  7  8 
7 

3  5  7 


2.1  Three  Popular  Data  Displays 
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In  this  example  the  scores  have  a  natural  stem  (the  tens  place)  and  leaf  (the  ones 
place).  One  could  spread  the  diagram  out  by  splitting  each  tens  place  number  into 
lower  and  upper  categories.  For  example,  all  the  scores  in  the  80s  may  be 
represented  on  two  separate  stems,  lower  80s  and  upper  80s: 


8  0  3  3 
8  6  7 


The  definitions  of  stems  and  leaves  are  flexible  in  practice.  The  general  purpose  of  a 
stem  and  leaf  diagram  is  to  provide  a  quick  display  of  how  the  data  are  distributed 
across  the  range  of  their  values;  some  improvisation  could  be  necessary  to  obtain  a 
diagram  that  best  meets  that  goal. 


Note  that  all  of  the  original  data  can  be  recovered  from  the  stem  and  leaf  diagram. 
This  will  not  be  true  in  the  next  two  types  of  graphical  displays. 

Frequency  Histograms 

The  stem  and  leaf  diagram  is  not  practical  for  large  data  sets,  so  we  need  a  different, 
purely  graphical  way  to  represent  data.  A  frequency  histogram1  is  such  a  device. 
We  will  illustrate  it  using  the  same  data  set  from  the  previous  subsection.  For  the  30 
scores  on  the  exam,  it  is  natural  to  group  the  scores  on  the  standard  ten-point  scale, 
and  count  the  number  of  scores  in  each  group.  Thus  there  are  two  100s,  seven 
scores  in  the  90s,  six  in  the  80s,  and  so  on.  We  then  construct  the  diagram  shown  in 
Figure  2.3  "Frequency  Histogram"  by  drawing  for  each  group,  or  class,  a  vertical  bar 
whose  length  is  the  number  of  observations  in  that  group.  In  our  example,  the  bar 
labeled  100  is  2  units  long,  the  bar  labeled  90  is  7  units  long,  and  so  on.  While  the 
individual  data  values  are  lost,  we  know  the  number  in  each  class.  This  number  is 
called  the  frequency2  of  the  class,  hence  the  name  frequency  histogram. 


1.  A  graphical  device  showing 
how  data  are  distributed  across 
the  range  of  their  values  by 
collecting  them  into  classes 
and  indicating  the  number  of 
measurements  in  each  class. 

2.  Of  a  class  of  measurements,  the 
number  of  measurements  in 
the  data  set  that  are  in  the 
class. 


2.1  Three  Popular  Data  Displays 


24 


Chapter  2  Descriptive  Statistics 


Figure  2.3  Frequency  Histogram 


Score 


The  same  procedure  can  be  applied  to  any  collection  of  numerical  data. 

Observations  are  grouped  into  several  classes  and  the  frequency  (the  number  of 
observations)  of  each  class  is  noted.  These  classes  are  arranged  and  indicated  in 
order  on  the  horizontal  axis  (called  the  x-axis),  and  for  each  group  a  vertical  bar, 
whose  length  is  the  number  of  observations  in  that  group,  is  drawn.  The  resulting 
display  is  a  frequency  histogram  for  the  data.  The  similarity  in  Figure  2.1  "Stem  and 
Leaf  Diagram"  and  Figure  2.3  "Frequency  Histogram"  is  apparent,  particularly  if  you 
imagine  turning  the  stem  and  leaf  diagram  on  its  side  by  rotating  it  a  quarter  turn 
counterclockwise. 


In  general,  the  definition  of  the  classes  in  the  frequency  histogram  is  flexible.  The 
general  purpose  of  a  frequency  histogram  is  very  much  the  same  as  that  of  a  stem 
and  leaf  diagram,  to  provide  a  graphical  display  that  gives  a  sense  of  data 
distribution  across  the  range  of  values  that  appear.  We  will  not  discuss  the  process 
of  constructing  a  histogram  from  data  since  in  actual  practice  it  is  done 
automatically  with  statistical  software  or  even  handheld  calculators. 


2.1  Three  Popular  Data  Displays 
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Relative  Frequency  Histograms 

In  our  example  of  the  exam  scores  in  a  statistics  class,  five  students  scored  in  the 
80s.  The  number  5  is  the  frequency  of  the  group  labeled  “80s.”  Since  there  are  30 
students  in  the  entire  statistics  class,  the  proportion  who  scored  in  the  80s  is  5/30. 
The  number  5/30,  which  could  also  be  expressed  as  0.16  «.  1667,  or  as  16.67%,  is 
the  relative  frequency3  of  the  group  labeled  “80s.”  Every  group  (the  70s,  the  80s, 
and  so  on)  has  a  relative  frequency.  We  can  thus  construct  a  diagram  by  drawing  for 
each  group,  or  class,  a  vertical  bar  whose  length  is  the  relative  frequency  of  that 
group.  For  example,  the  bar  for  the  80s  will  have  length  5/30  unit,  not  5  units.  The 
diagram  is  a  relative  frequency  histogram4  for  the  data,  and  is  shown  in  Figure  2.4 
"Relative  Frequency  Histogram".  It  is  exactly  the  same  as  the  frequency  histogram 
except  that  the  vertical  axis  in  the  relative  frequency  histogram  is  not  frequency 
but  relative  frequency. 


Figure  2.4  Relative  Frequency  Flistogram 


3.  Of  a  class  of  measurements,  the 
proportion  of  all 
measurements  in  the  data  set 
that  are  in  the  class. 


Score 


4.  A  graphical  device  showing 
how  data  are  distributed  across 
the  range  of  their  values  by 
collecting  them  into  classes 
and  indicating  the  proportion 
of  measurements  in  each  class. 


The  same  procedure  can  be  applied  to  any  collection  of  numerical  data.  Classes  are 
selected,  the  relative  frequency  of  each  class  is  noted,  the  classes  are  arranged  and 
indicated  in  order  on  the  horizontal  axis,  and  for  each  class  a  vertical  bar,  whose 
length  is  the  relative  frequency  of  the  class,  is  drawn.  The  resulting  display  is  a 
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relative  frequency  histogram  for  the  data.  A  key  point  is  that  now  if  each  vertical 
bar  has  width  1  unit,  then  the  total  area  of  all  the  bars  is  1  or  100%. 


Although  the  histograms  in  Figure  2,3  "Frequency  Histogram"  and  Figure  2,4 
"Relative  Frequency  Histogram"  have  the  same  appearance,  the  relative  frequency 
histogram  is  more  important  for  us,  and  it  will  be  relative  frequency  histograms 
that  will  be  used  repeatedly  to  represent  data  in  this  text.  To  see  why  this  is  so, 
reflect  on  what  it  is  that  you  are  actually  seeing  in  the  diagrams  that  quickly  and 
effectively  communicates  information  to  you  about  the  data.  It  is  the  relative  sizes  of 
the  bars.  The  bar  labeled  “70s”  in  either  figure  takes  up  l/ 3  of  the  total  area  of  all 
the  bars,  and  although  we  may  not  think  of  this  consciously,  we  perceive  the 
proportion  l/ 3  in  the  figures,  indicating  that  a  third  of  the  grades  were  in  the  70s. 
The  relative  frequency  histogram  is  important  because  the  labeling  on  the  vertical 
axis  reflects  what  is  important  visually:  the  relative  sizes  of  the  bars. 


When  the  size  n  of  a  sample  is  small  only  a  few  classes  can  be  used  in  constructing  a 
relative  frequency  histogram.  Such  a  histogram  might  look  something  like  the  one 
in  panel  (a)  of  Figure  2.5  "Sample  Size  and  Relative  Frequency  Histograms",  if  the 
sample  size  n  were  increased,  then  more  classes  could  be  used  in  constructing  a 
relative  frequency  histogram  and  the  vertical  bars  of  the  resulting  histogram  would 
be  finer,  as  indicated  in  panel  (b)  of  Figure  2.5  "Sample  Size  and  Relative  Frequency 
Histograms".  For  a  very  large  sample  the  relative  frequency  histogram  would  look 
very  fine,  like  the  one  in  (c)  of  Figure  2.5  "Sample  Size  and  Relative  Frequency 
Histograms",  if  the  sample  size  were  to  increase  indefinitely  then  the 
corresponding  relative  frequency  histogram  would  be  so  fine  that  it  would  look  like 
a  smooth  curve,  such  as  the  one  in  panel  (d)  of  Figure  2.5  "Sample  Size  and  Relative 
Frequency  Histograms". 
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Figure  2.5  Sample  Size  and  Relative  Frequency  Flistograms 


(a)  Small  Sample 


(c)  Large  Sample  (d)  Very  Large  Sample 


It  is  common  in  statistics  to  represent  a  population  or  a  very  large  data  set  by  a 
smooth  curve.  It  is  good  to  keep  in  mind  that  such  a  curve  is  actually  just  a  very  fine 
relative  frequency  histogram  in  which  the  exceedingly  narrow  vertical  bars  have 
disappeared.  Because  the  area  of  each  such  vertical  bar  is  the  proportion  of  the  data 
that  lies  in  the  interval  of  numbers  over  which  that  bar  stands,  this  means  that  for 
any  two  numbers  a  and  b,  the  proportion  of  the  data  that  lies  between  the  two 
numbers  a  and  b  is  the  area  under  the  curve  that  is  above  the  interval  (a,b)  in  the 
horizontal  axis.  This  is  the  area  shown  in  Figure  2.6  "A  Very  Fine  Relative 
Frequency  Histogram".  In  particular  the  total  area  under  the  curve  is  1,  or  100%. 
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Figure  2.6  A  Very  Fine  Relative  Frequency  Flistogram 

Shaded  Area  =  Proportion  of  Data  between  a  and  b 


KEY  TAKEAWAYS 


•  Graphical  representations  of  large  data  sets  provide  a  quick  overview  of 
the  nature  of  the  data. 

•  A  population  or  a  very  large  data  set  may  be  represented  by  a  smooth 
curve.  This  curve  is  a  very  fine  relative  frequency  histogram  in  which 
the  exceedingly  narrow  vertical  bars  have  been  omitted. 

•  When  a  curve  derived  from  a  relative  frequency  histogram  is  used  to 
describe  a  data  set,  the  proportion  of  data  with  values  between  two 
numbers  a  and  b  is  the  area  under  the  curve  between  a  and  b,  as 
illustrated  in  Figure  2.6  "A  Very  Fine  Relative  Frequency  Histogram". 
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1.  Describe  one  difference  between  a  frequency  histogram  and  a  relative 
frequency  histogram. 

2.  Describe  one  advantage  of  a  stem  and  leaf  diagram  over  a  frequency 
histogram. 

3.  Construct  a  stem  and  leaf  diagram,  a  frequency  histogram,  and  a  relative 
frequency  histogram  for  the  following  data  set.  For  the  histograms  use  classes 
51-60,  61-70,  and  so  on. 

69  92  68  77  80 

70  85  88  85  96 


93  75  76  82  100 
53  70  70  82  85 

4.  Construct  a  stem  and  leaf  diagram,  a  frequency  histogram,  and  a  relative 
frequency  histogram  for  the  following  data  set.  For  the  histograms  use  classes 
6.0-6.9,  7.0-7.9,  and  so  on. 

8.5  8.2  7.0  7.0  4.9 

6.5  8.2  7.6  1.5  9.3 


9.6  8.5  8.8  8.5  8.7 
8.0  7.7  2.9  9.2  6.9 

5.  A  data  set  contains  n  =  10  observations.  The  values  x  and  their  frequencies  fare 
summarized  in  the  following  data  frequency  table. 


X 

-10  12 

/ 

3  4  2  1 

Construct  a  frequency  histogram  and  a  relative  frequency  histogram  for  the 
data  set. 
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6.  A  data  set  contains  the  n  =  20  observations  The  values  x  and  their  frequencies  f 
are  summarized  in  the  following  data  frequency  table. 


X 

-10  12 

/ 

3  a  2  l 

The  frequency  of  the  value  0  is  missing.  Find  a  and  then  sketch  a  frequency 
histogram  and  a  relative  frequency  histogram  for  the  data  set. 

7.  A  data  set  has  the  following  frequency  distribution  table: 


X 

1 

2 

3 

4 

f 

3 

a 

2 

1 

The  number  a  is  unknown.  Can  you  construct  a  frequency  histogram?  if  so, 
construct  it.  if  not,  say  why  not. 

8.  A  table  of  some  of  the  relative  frequencies  computed  from  a  data  set  is 


X 

12  3  4 

f  /  n 

0.3  p  0.2  0.1 

The  number  p  is  yet  to  be  computed.  Finish  the  table  and  construct  the  relative 
frequency  histogram  for  the  data  set. 


APPLICATIONS 


9.  The  IQ  scores  of  ten  students  randomly  selected  from  an  elementary  school  are 
given. 

108  100  99  125  87 

105  107  105  119  118 

Grouping  the  measures  in  the  80s,  the  90s,  and  so  on,  construct  a  stem  and  leaf 
diagram,  a  frequency  histogram,  and  a  relative  frequency  histogram. 

10.  The  IQ  scores  of  ten  students  randomly  selected  from  an  elementary  school  for 
academically  gifted  students  are  given. 

133  140  152  142  137 

145  160  138  139  138 

Grouping  the  measures  by  their  common  hundreds  and  tens  digits,  construct  a 
stem  and  leaf  diagram,  a  frequency  histogram,  and  a  relative  frequency 
histogram. 
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11.  During  a  one-day  blood  drive  300  people  donated  blood  at  a  mobile  donation 


center.  The  blood  types  of  these  300 

donors 

are  summarized  in 

Blood  Type 

0 

A 

B 

AB 

Frequency 

136 

120 

32 

12 

Construct  a  relative  frequency  histogram  for  the  data  set. 

12.  In  a  particular  kitchen  appliance  store  an  electric  automatic  rice  cooker  is  a 
popular  item.  The  weekly  sales  for  the  last  20  weeks  are  shown. 

20  15  14  14  18 
15  17  16  16  18 


15  19  12  13  9 
19  15  15  16  15 

Construct  a  relative  frequency  histogram  with  classes  6-10, 11-15,  and  16-20. 


ADDITIONAL  EXERCISES 


13.  Random  samples,  each  of  size  n  =  10,  were  taken  of  the  lengths  in  centimeters 
of  three  kinds  of  commercial  fish,  with  the  following  results: 


Sample  1 : 

108 

100 

99 

125 

87 

105 

107 

105 

119 

118 

Sample  2: 

133 

140 

152 

142 

137 

145 

160 

138 

139 

138 

Sample  3: 

82 

60 

83 

82 

82 

74 

79 

82 

80 

80 

Grouping  the  measures  by  their  common  hundreds  and  tens  digits,  construct  a 
stem  and  leaf  diagram,  a  frequency  histogram,  and  a  relative  frequency 
histogram  for  each  of  the  samples.  Compare  the  histograms  and  describe  any 
patterns  they  exhibit. 

14.  During  a  one-day  blood  drive  300  people  donated  blood  at  a  mobile  donation 
center.  The  blood  types  of  these  300  donors  are  summarized  below. 
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Blood  Type 

0  A  B  AB 

Frequency 

136  120  32  12 

Identify  the  blood  type  that  has  the  highest  relative  frequency  for  these  300 
people.  Can  you  conclude  that  the  blood  type  you  identified  is  also  most 
common  for  all  people  in  the  population  at  large?  Explain. 


15.  In  a  particular  kitchen  appliance  store,  the  weekly  sales  of  an  electric 
automatic  rice  cooker  for  the  last  20  weeks  are  as  follows. 

20  15  14  14  18 
15  17  16  16  18 


15  19  12  13  9 
19  15  15  16  15 

In  retail  sales,  too  large  an  inventory  ties  up  capital,  while  too  small  an 
inventory  costs  lost  sales  and  customer  satisfaction.  Using  the  relative 
frequency  histogram  for  these  data,  find  approximately  how  many  rice  cookers 
must  be  in  stock  at  the  beginning  of  each  week  if 

a.  the  store  is  not  to  run  out  of  stock  by  the  end  of  a  week  for  more  than  15% 
of  the  weeks;  and 

b.  the  store  is  not  to  run  out  of  stock  by  the  end  of  a  week  for  more  than  5% 
of  the  weeks. 
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ANSWERS 


1.  The  vertical  scale  on  one  is  the  frequencies  and  on  the  other  is  the  relative 
frequencies. 


3. 


5 

6 

7 

8 
9 

10 


3 

8  9 

0  0  0  5  6  7 
0  2  3  5  5  5  8 
2  3  6 
0 


Frequency  and  relative  frequency  histograms  are  similarly  generated. 
5.  Noting  that  n  =  10  the  relative  frequency  table  is: 


X 

-10  12 

f  /  n 

0.3  0.4  0.2  0.1 

7.  Since  n  is  unknown,  a  is  unknown,  so  the  histogram  cannot  be  constructed. 
9. 


8 

9 

10 

11 

12 


7 
9 

0  5  5  7  8 

8  9 
5 


Frequency  and  relative  frequency  histograms  are  similarly  generated. 
11.  Noting  n  =  300,  the  relative  frequency  table  is  therefore: 


Blood  Type 

O 

A 

B 

AB 

f  /  n 

0.4533 

0.4 

0.1067 

0.04 

A  relative  frequency  histogram  is  then  generated. 

13.  The  stem  and  leaf  diagrams  listed  for  Samples  1,  2,  and  3  in  that  order. 
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6 

7 

8 

7 

9 

9 

10 

0  5  5  7  8 

11 

8  9 

12 

5 

13 

14 

15 

16 

6 

7 

8 

9 

10 

11 

12 

13 

3  7  8  8  9 

14 

0  2  5 

15 

2 

16 

0 
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6 

0 

7 

4  9 

8 

0  0  2  2  2  2  3 

9 

10 

11 

12 

13 

14 

15 

16 

The  frequency  tables  are  given  below  in  the  same  order. 


Length 

80  -  89  90  -  99  100  -  109 

/ 

1  1  5 

Length  110-119  120-129 


/  2  1 


Length 

130  -  139  140  -  149  150  -  159 

/ 

5 

Length 

3  1 

160  -  169 

Length 

/ 

60  —  6S 

1 

70  -  79  80  -  89 

/ 

1  2  7 

The  relative  frequency  tables  are  given  below  in  the  same  order. 


Length 

80  -  89  90  -  99  100  -  109 

f  /  n 

Leng 

( 

th 

LI  0.1  0.5 

110  -  119  120  -  129 

f  /  n 

0.2  0.1 
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15. 


a.  19. 

b.  20. 


Length 

130  -  139 

140  -  149 

f  /  n 

0.5 

0.3 

Length 

160  -  169 

f  /n 

0.1 

0.1 


Length 


f  /  n 


60  -  69  70  -  79  80  -  89 


0.1  0.2 


0.7 


159 
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2.2  Measures  of  Central  Location 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  “center”  of  a  data  set. 

2.  To  learn  the  meaning  of  each  of  three  measures  of  the  center  of  a  data 
set— the  mean,  the  median,  and  the  mode— and  how  to  compute  each 
one. 


This  section  could  be  titled  “three  kinds  of  averages  of  a  data  set.”  Any  kind  of 
“average”  is  meant  to  be  an  answer  to  the  question  “Where  do  the  data  center?”  It 
is  thus  a  measure  of  the  central  location  of  the  data  set.  We  will  see  that  the  nature 
of  the  data  set,  as  indicated  by  a  relative  frequency  histogram,  will  determine  what 
constitutes  a  good  answer.  Different  shapes  of  the  histogram  call  for  different 
measures  of  central  location. 

The  Mean 

The  first  measure  of  central  location  is  the  usual  “average”  that  is  familiar  to 
everyone.  In  the  formula  in  the  following  definition  we  introduce  the  standard 
summation  notation  E,  where  E  is  the  capital  Greek  letter  sigma.  In  general,  the 
notation  E  followed  by  a  second  mathematical  symbol  means  to  add  up  all  the 
values  that  the  second  symbol  can  take  in  the  context  of  the  problem.  Here  is  an 
example  to  illustrate  this. 
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EXAMPLE  1 


Find  Hx ,  £x  " ,  and  2(x—  1)  for  the  data  set 

1  3  4 


Solution: 


Ex  =  1  +  3  +  4  =  8 
Ex2  =  l2  +  32  +  42  =  1  +  9  +  16  =  26 
£(x-l)2  =  (1  -  l)2  +  (3  -  l)2  +  (4  -  l)2  =  02  +  22  +  32  =  13 


In  the  definition  we  follow  the  convention  of  using  lowercase  n  to  denote  the 
number  of  measurements  in  a  sample,  which  is  called  the  sample  size. 


Definition 


The  sample  mean5  of  a  set  ofn  sample  data  is  the  number  x  defined  by  the  formula 


__  Zx 
n 


5.  The  familiar  average  of  a 
sample  data  set. 


EXAMPLE  2 

Find  the  mean  of  the  sample  data 

2-102 

Solution: 

2x  2+  (— 1)  +  0  +  2 

n  4 

3 

=  0.75 

4 
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2.2  Measures  of  Central  Location 


EXAMPLE  3 


A  random  sample  of  ten  students  is  taken  from  the  student  body  of  a  college 
and  their  GPAs  are  recorded  as  follows. 


1.90  3.00  2.53  3.71  2.12  1.76  2.71  1.39  4.00  3.33 


Find  the  sample  mean. 


Solution: 


Ex  1.90  +  3.00  +  2.53  4-  3.71  +  2.12  +  1.76  +  2.71  4-  1.39  +  4. 


x  = 


n 

26.45 


10 


10 


-  2.645 
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EXAMPLE  4 


A  random  sample  of  19  women  beyond  child-bearing  age  gave  the  following 
data,  where  x  is  the  number  of  children  and  f  is  the  frequency  of  that  value, 
the  number  of  times  it  occurred  in  the  data  set. 


X 

0 

1 

2 

3 

4 

/ 

3 

6 

6 

3 

1 

Find  the  sample  mean. 
Solution: 


In  this  example  the  data  are  presented  by  means  of  a  data  frequency  table, 
introduced  in  Chapter  1  "Introduction".  Each  number  in  the  first  line  of  the 
table  is  a  number  that  appears  in  the  data  set;  the  number  below  it  is  how 
many  times  it  occurs.  Thus  the  value  0  is  observed  three  times,  that  is,  three 
of  the  measurements  in  the  data  set  are  0,  the  value  1  is  observed  six  times, 
and  so  on.  In  the  context  of  the  problem  this  means  that  three  women  in  the 
sample  have  had  no  children,  six  have  had  exactly  one  child,  and  so  on.  The 
explicit  list  of  all  the  observations  in  this  data  set  is  therefore 

0001111112222223334 


The  sample  size  can  be  read  directly  from  the  table,  without  first  listing  the 
entire  data  set,  as  the  sum  of  the  frequencies: 

71  =  3  +  6  +  6  +  3  +  1  =  19.  The  sample  mean  can  be  computed 
directly  from  the  table  as  well: 


2x  0x3  +  1  x6  +  2x6  +  3x3  +  4x  1 


x  = 


n 


19 


31 

~L9 


=  1.6316 


In  the  examples  above  the  data  sets  were  described  as  samples.  Therefore  the 
means  were  sample  means,  denoted  by  x.  if  the  data  come  from  a  census,  so  that 
there  is  a  measurement  for  every  element  of  the  population,  then  the  mean  is 
calculated  by  exactly  the  same  process  of  summing  all  the  measurements  and 
dividing  by  how  many  of  them  there  are,  but  it  is  now  the  population  mean  and  is 
denoted  by  /j,  the  lower  case  Greek  letter  mu. 
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Definition 

The  population  mean6 

of  a  set  ofN  population  data  is  the  number  pi  defined  by  the 

formula 

II 

The  mean  of  two  numbers  is  the  number  that  is  halfway  between  them.  For 
example,  the  average  of  the  numbers  5  and  17  is  (5  +  17)/2  =  11,  which  is  6  units 
above  5  and  6  units  below  17.  In  this  sense  the  average  11  is  the  “center”  of  the  data 
set  {5,17}.  For  larger  data  sets  the  mean  can  similarly  be  regarded  as  the  “center”  of 
the  data. 

The  Median 

To  see  why  another  concept  of  average  is  needed,  consider  the  following  situation. 
Suppose  we  are  interested  in  the  average  yearly  income  of  employees  at  a  large 
corporation.  We  take  a  random  sample  of  seven  employees,  obtaining  the  sample 
data  (rounded  to  the  nearest  hundred  dollars,  and  expressed  in  thousands  of 
dollars). 


24.8  22.8  24.6  192.5  25.2  18.5  23.7 


6.  The  familiar  average  of  a 
population  data  set. 


The  mean  (rounded  to  one  decimal  place)  is  x  —  47.4,  but  the  statement  “the 
average  income  of  employees  at  this  corporation  is  $47,400”  is  surely  misleading.  It 
is  approximately  twice  what  six  of  the  seven  employees  in  the  sample  make  and  is 
nowhere  near  what  any  of  them  makes.  It  is  easy  to  see  what  went  wrong:  the 
presence  of  the  one  executive  in  the  sample,  whose  salary  is  so  large  compared  to 
everyone  else’s,  caused  the  numerator  in  the  formula  for  the  sample  mean  to  be  far 
too  large,  pulling  the  mean  far  to  the  right  of  where  we  think  that  the  average 
“ought”  to  be,  namely  around  $24,000  or  $25,000.  The  number  192.5  in  our  data  set 
is  called  an  outlier,  a  number  that  is  far  removed  from  most  or  all  of  the  remaining 
measurements.  Many  times  an  outlier  is  the  result  of  some  sort  of  error,  but  not 
always,  as  is  the  case  here.  We  would  get  a  better  measure  of  the  “center”  of  the 
data  if  we  were  to  arrange  the  data  in  numerical  order, 

18.5  22.8  23.7  24.6  24.8  25.2  192.5 
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then  select  the  middle  number  in  the  list,  in  this  case  24.6.  The  result  is  called  the 
median  of  the  data  set,  and  has  the  property  that  roughly  half  of  the  measurements 
are  larger  than  it  is,  and  roughly  half  are  smaller.  In  this  sense  it  locates  the  center 
of  the  data,  if  there  are  an  even  number  of  measurements  in  the  data  set,  then  there 
will  be  two  middle  elements  when  all  are  lined  up  in  order,  so  we  take  the  mean  of 
the  middle  two  as  the  median.  Thus  we  have  the  following  definition. 


Definition 

The  sample  median7  X  of  a  set  of  sample  data  for  which  there  are  an  odd  number  of 
measurements  is  the  middle  measurement  when  the  data  are  arranged  in  numerical 
order.  The  sample  median  X  of  a  set  of  sample  data  for  which  there  are  an  even 
number  of  measurements  is  the  mean  of  the  two  middle  measurements  when  the  data 
are  arranged  in  numerical  order. 


The  population  median  is  defined  in  a  similar  way,  but  we  will  not  have  occasion  to 
refer  to  it  again  in  this  text. 


The  median  is  a  value  that  divides  the  observations  in  a  data  set  so  that  50%  of  the 
data  are  on  its  left  and  the  other  50%  on  its  right.  In  accordance  with  Figure  2,6  "A 
Very  Fine  Relative  Frequency  Histogram",  therefore,  in  the  curve  that  represents 
the  distribution  of  the  data,  a  vertical  line  drawn  at  the  median  divides  the  area  in 
two,  area  0.5  (50%  of  the  total  area  l)  to  the  left  and  area  0.5  (50%  of  the  total  area  l) 
to  the  right,  as  shown  in  Figure  2.7  "The  Median".  In  our  income  example  the 
median,  $24,600,  clearly  gave  a  much  better  measure  of  the  middle  of  the  data  set 
than  did  the  mean  $47,400.  This  is  typical  for  situations  in  which  the  distribution  is 
skewed.  (Skewness  and  symmetry  of  distributions  are  discussed  at  the  end  of  this 
subsection.) 


7.  The  middle  value  when  data 
are  listed  in  numerical  order. 
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Figure  2.7  The  Median 


EXAMPLE  5 


Compute  the  sample  median  for  the  data  of  Note  2.11  "Example  2". 
Solution: 

The  data  in  numerical  order  are  -1,  0,  2,  2.  The  two  middle  measurements 
are  0  and  2,  so  X  =  (0  +  2)  /  2  =  1 . 
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EXAMPLE  6 


Compute  the  sample  median  for  the  data  of  Note  2.12  "Example  3". 

Solution: 

The  data  in  numerical  order  are 

1.39  1.76  1.90  2.12  2.53  2.71  3.00  3.33  3.71  4.00 

The  number  of  observations  is  ten,  which  is  even,  so  there  are  two  middle 
measurements,  the  fifth  and  sixth,  which  are  2.53  and  2.71.  Therefore  the 

median  of  these  data  is  X  =  (2.53  +  2.71)  /  2  =  2.62. 


EXAMPLE  7 


Compute  the  sample  median  for  the  data  of  Note  2.13  "Example  4". 

Solution: 

The  data  in  numerical  order  are 

0001111112222223334 

The  number  of  observations  is  19,  which  is  odd,  so  there  is  one  middle 
measurement,  the  tenth.  Since  the  tenth  measurement  is  2,  the  median  is 

Y  =  2. 

It  is  important  to  note  that  we  could  have  computed  the  median  without 
first  explicitly  listing  all  the  observations  in  the  data  set.  We  already  saw  in 
Note  2.13  "Example  4"  how  to  find  the  number  of  observations  directly  from 
the  frequencies  listed  in  the  table:  11  =  3  +  6  +  6  +  3  +  1  =  19.  As 
just  above  we  figure  out  that  the  median  is  the  tenth  observation.  The 
second  line  of  the  table  in  Note  2.13  "Example  4"  shows  that  when  the  data 
are  listed  in  order  there  will  be  three  0s  followed  by  six  Is,  so  the  tenth 
observation  is  a  2.  The  median  is  therefore  2. 
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The  relationship  between  the  mean  and  the  median  for  several  common  shapes  of 
distributions  is  shown  in  Figure  2.8  "Skewness  of  Relative  Frequency  Histograms", 
The  distributions  in  panels  (a)  and  (b)  are  said  to  be  symmetric  because  of  the 
symmetry  that  they  exhibit.  The  distributions  in  the  remaining  two  panels  are  said 
to  be  skewed.  In  each  distribution  we  have  drawn  a  vertical  line  that  divides  the  area 
under  the  curve  in  half,  which  in  accordance  with  Figure  2.7  "The  Median"  is 
located  at  the  median.  The  following  facts  are  true  in  general: 


a.  When  the  distribution  is  symmetric,  as  in  panels  (a)  and  (b)  of  Figure 
2.8  "Skewness  of  Relative  Frequency  Histograms",  the  mean  and  the 
median  are  equal. 

b.  When  the  distribution  is  as  shown  in  panel  (c)  of  Figure  2.8  "Skewness 
of  Relative  Frequency  Histograms",  it  is  said  to  be  skewed  right.  The 
mean  has  been  pulled  to  the  right  of  the  median  by  the  long  “right  tail” 
of  the  distribution,  the  few  relatively  large  data  values. 

c.  When  the  distribution  is  as  shown  in  panel  (d)  of  Figure  2.8  "Skewness 
of  Relative  Frequency  Histograms",  it  is  said  to  be  skewed  left.  The  mean 
has  been  pulled  to  the  left  of  the  median  by  the  long  “left  tail”  of  the 
distribution,  the  few  relatively  small  data  values. 


Figure  2.8  Skewness  of  Relative  Frequency  Histograms 


(a)  x  =  x 


(c)  x  >  x 


(b)  x  =  x 


(d)  x  <  x 
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The  Mode 

Perhaps  you  have  heard  a  statement  like  “The  average  number  of  automobiles 
owned  by  households  in  the  United  States  is  1.37,”  and  have  been  amused  at  the 
thought  of  a  fraction  of  an  automobile  sitting  in  a  driveway.  In  such  a  context  the 
following  measure  for  central  location  might  make  more  sense. 


Definition 

The  sample  mode8  of  a  set  of  sample  data  is  the  most  frequently  occurring  value. 


The  population  mode  is  defined  in  a  similar  way,  but  we  will  not  have  occasion  to 
refer  to  it  again  in  this  text. 


On  a  relative  frequency  histogram,  the  highest  point  of  the  histogram  corresponds 
to  the  mode  of  the  data  set.  Figure  2.9  "Mode"  illustrates  the  mode. 


Figure  2.9  Mode 


8.  The  most  frequent  value  in  a 
data  set. 
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For  any  data  set  there  is  always  exactly  one  mean  and  exactly  one  median.  This 
need  not  be  true  of  the  mode;  several  different  values  could  occur  with  the  highest 
frequency,  as  we  will  see.  It  could  even  happen  that  every  value  occurs  with  the 
same  frequency,  in  which  case  the  concept  of  the  mode  does  not  make  much  sense. 


EXAMPLE  8 


Find  the  mode  of  the  following  data  set. 


-10  2  0 


Solution: 

The  value  0  is  most  frequently  observed  and  therefore  the  mode  is  0. 


EXAMPLE  9 


Compute  the  sample  mode  for  the  data  of  Note  2.13  "Example  4". 
Solution: 

The  two  most  frequently  observed  values  in  the  data  set  are  1  and  2. 
Therefore  mode  is  a  set  of  two  values:  {l,2}. 


The  mode  is  a  measure  of  central  location  since  most  real-life  data  sets  have  more 
observations  near  the  center  of  the  data  range  and  fewer  observations  on  the  lower 
and  upper  ends.  The  value  with  the  highest  frequency  is  often  in  the  middle  of  the 
data  range. 


KEY  TAKEAWAY 


The  mean,  the  median,  and  the  mode  each  answer  the  question  “Where  is 
the  center  of  the  data  set?”  The  nature  of  the  data  set,  as  indicated  by  a 
relative  frequency  histogram,  determines  which  one  gives  the  best  answer. 
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2.2  Measures  of  Central  Location 


1.  For  the  sample  data  set  {1,2,6}  find 

a.  Ex 

b.  Ex2 

C.  E  (x— 3) 
d.  E(x— 3)2 


2.  For  the  sample  data  set  {  — 1,0, 1,4}  find 

a.  Ex 

b.  Ex2 

c.  E  (x— 1) 

d.  E(x— l)2 


3.  Find  the  mean,  the  median,  and  the  mode  for  the  sample 

12  3  4 

4.  Find  the  mean,  the  median,  and  the  mode  for  the  sample 

3  3  4  4 

5.  Find  the  mean,  the  median,  and  the  mode  for  the  sample 

2  12  7 

6.  Find  the  mean,  the  median,  and  the  mode  for  the  sample 

-101411 


7.  Find  the  mean,  the  median,  and  the  mode  for  the  sample  data  represented  by 
the  table 


X 

1  2  7 

/ 

1  2  1 

8.  Find  the  mean,  the  median,  and  the  mode  for  the  sample  data  represented  by 
the  table 


X 

-10  14 

/ 

113  1 

9.  Create  a  sample  data  set  of  size  n  =  3  for  which  the  mean  X  is  greater  than  the 
median  X  . 


49 


Chapter  2  Descriptive  Statistics 


10.  Create  a  sample  data  set  of  size  n  =  3  for  which  the  mean  X  is  less  than  the 
median  X  . 

11.  Create  a  sample  data  set  of  size  n  =  4  for  which  the  mean  X,  the  median  X  ,  and 
the  mode  are  all  identical. 

12.  Create  a  data  set  of  size  n  =  4  for  which  the  median  X  and  the  mode  are 
identical  but  the  mean  X  is  different. 


APPLICATIONS 


13.  Find  the  mean  and  the  median  for  the  LDL  cholesterol  level  in  a  sample  of  ten 
heart  patients. 

132  162  133  145  148 

139  147  160  150  153 

14.  Find  the  mean  and  the  median,  for  the  LDL  cholesterol  level  in  a  sample  of  ten 
heart  patients  on  a  special  diet. 

127  152  138  110  152 

113  131  148  135  158 

15.  Find  the  mean,  the  median,  and  the  mode  for  the  number  of  vehicles  owned  in 
a  survey  of  52  households. 


X 

0  1 

2 

3 

4 

5 

6  7 

/ 

2  12 

15 

11 

6 

3 

1  2 

16.  The  number  of  passengers  in  each  of  120  randomly  observed  vehicles  during 
morning  rush  hour  was  recorded,  with  the  following  results. 


X 

1 

2 

3 

4  5 

/ 

84 

29 

3 

3  1 

Find  the  mean,  the  median,  and  the  mode  of  this  data  set. 


17.  Twenty-five  1-lb  boxes  of  16d  nails  were  randomly  selected  and  the  number  of 
nails  in  each  box  was  counted,  with  the  following  results. 


X 

47 

48 

49 

50  51 

/ 

1 

3 

18 

2  1 

Find  the  mean,  the  median,  and  the  mode  of  this  data  set. 
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ADDITIONAL  EXERCISES 


18.  Five  laboratory  mice  with  thymus  leukemia  are  observed  for  a  predetermined 
period  of  500  days.  After  500  days,  four  mice  have  died  but  the  fifth  one 
survives.  The  recorded  survival  times  for  the  five  mice  are 

493  421  222  378  500* 

where  500*  indicates  that  the  fifth  mouse  survived  for  at  least  500  days  but 
the  survival  time  (i.e.,  the  exact  value  of  the  observation)  is  unknown. 

a.  Can  you  find  the  sample  mean  for  the  data  set?  if  so,  find  it.  if  not,  why 
not? 

b.  Can  you  find  the  sample  median  for  the  data  set?  if  so,  find  it.  if  not,  why 
not? 

19.  Five  laboratory  mice  with  thymus  leukemia  are  observed  for  a  predetermined 
period  of  500  days.  After  450  days,  three  mice  have  died,  and  one  of  the 
remaining  mice  is  sacrificed  for  analysis.  By  the  end  of  the  observational 
period,  the  last  remaining  mouse  still  survives.  The  recorded  survival  times  for 
the  five  mice  are 

222  421  378  450*  500* 

where  *  indicates  that  the  mouse  survived  for  at  least  the  given  number  of 
days  but  the  exact  value  of  the  observation  is  unknown. 

a.  Can  you  find  the  sample  mean  for  the  data  set?  if  so,  find  it.  if  not,  explain 
why  not. 

b.  Can  you  find  the  sample  median  for  the  data  set?  if  so,  find  it.  if  not, 
explain  why  not. 

20.  A  player  keeps  track  of  all  the  rolls  of  a  pair  of  dice  when  playing  a  board  game 
and  obtains  the  following  data. 


X 

2 

3 

4 

5 

6 

7 

/ 

10 

29 

40 

56 

68 

77 

x  8  9  10  11  12 
/  67  55  39  28  11 

Find  the  mean,  the  median,  and  the  mode. 

21.  Cordelia  records  her  daily  commute  time  to  work  each  day,  to  the  nearest 
minute,  for  two  months,  and  obtains  the  following  data. 
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X 

26 

27 

28 

29 

30 

31 

32 

/ 

3 

4 

16 

12 

6 

2 

1 

a.  Based  on  the  frequencies,  do  you  expect  the  mean  and  the  median  to  be 
about  the  same  or  markedly  different,  and  why? 

b.  Compute  the  mean,  the  median,  and  the  mode. 


22.  An  ordered  stem  and  leaf  diagram  gives  the  scores  of  71  students  on  an  exam. 


10 

9 

8 

7 

6 

5 

4 

3 


0  0 

11112  3 
0  112  2  3 

0  0  0  1  1  2 

0  1  2  2  2  3 

0  2  3  3  4  4 

2  5  6  8  8 

9  9 


4  5  7  8  8  9 
4  4  5  6  6  6 
4  4  5  7  7  7 
6  7  7  8  9 


7  7  7  8  8  9 
7  8  8 


a.  Based  on  the  shape  of  the  display,  do  you  expect  the  mean  and  the  median 
to  be  about  the  same  or  markedly  different,  and  why? 

b.  Compute  the  mean,  the  median,  and  the  mode. 


23.  A  man  tosses  a  coin  repeatedly  until  it  lands  heads  and  records  the  number  of 
tosses  required.  (For  example,  if  it  lands  heads  on  the  first  toss  he  records  a  1; 
if  it  lands  tails  on  the  first  two  tosses  and  heads  on  the  third  he  records  a  3.) 
The  data  are  shown. 


X 

1 

2 

3 

4 

5 

6 

7 

8 

9  10 

/ 

384 

208 

98 

56 

28 

12 

8 

2 

3  1 

a.  Find  the  mean  of  the  data. 

b.  Find  the  median  of  the  data. 


24.  a.  Construct  a  data  set  consisting  of  ten  numbers,  all  but  one  of  which  is 

above  average,  where  the  average  is  the  mean, 
b.  Is  it  possible  to  construct  a  data  set  as  in  part  (a)  when  the  average  is  the 
median?  Explain. 

25.  Show  that  no  matter  what  kind  of  average  is  used  (mean,  median,  or  mode)  it  is 
impossible  for  all  members  of  a  data  set  to  be  above  average. 
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26.  a.  Twenty  sacks  of  grain  weigh  a  total  of  1,003  lb.  What  is  the  mean  weight 
per  sack? 

b.  Can  the  median  weight  per  sack  be  calculated  based  on  the  information 
given?  If  not,  construct  two  data  sets  with  the  same  total  but  different 
medians. 

27.  Begin  with  the  following  set  of  data,  call  it  Data  Set  I. 

5  -2  6  14  -3  0  1  4  3  2  5 

a.  Compute  the  mean,  median,  and  mode. 

b.  Form  a  new  data  set,  Data  Set  II,  by  adding  3  to  each  number  in  Data  Set  I. 
Calculate  the  mean,  median,  and  mode  of  Data  Set  II. 

c.  Form  a  new  data  set,  Data  Set  III,  by  subtracting  6  from  each  number  in 
Data  Set  I.  Calculate  the  mean,  median,  and  mode  of  Data  Set  III. 

d.  Comparing  the  answers  to  parts  (a),  (b),  and  (c),  can  you  guess  the  pattern? 
State  the  general  principle  that  you  expect  to  be  true. 


LARGE  DATA  SET  EXERCISES 


28.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Compute  the  mean  and  median  of  the  1,000  SAT  scores. 

b.  Compute  the  mean  and  median  of  the  1,000  GPAs. 

29.  Large  Data  Set  1  lists  the  SAT  scores  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  students  at  a  high  school,  in 
which  the  SAT  score  of  every  student  was  measured.  Compute  the 
population  mean  fi. 

b.  Regard  the  first  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  mean  X  and  compare  it  to  fi. 

c.  Regard  the  next  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  mean  X  and  compare  it  to  fi. 

30.  Large  Data  Set  1  lists  the  GPAs  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  freshman  at  a  small  college 
at  the  end  of  their  first  academic  year  of  college  study,  in  which  the  GPA  of 
every  such  person  was  measured.  Compute  the  population  mean  fi. 
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b.  Regard  the  first  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  mean  X  and  compare  it  to  fi. 

c.  Regard  the  next  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  mean  X  and  compare  it  to  /i. 

31.  Large  Data  Sets  7,  7A,  and  7B  list  the  survival  times  in  days  of  140  laboratory 

mice  with  thymic  leukemia  from  onset  to  death. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7A.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/data7B.xls 

a.  Compute  the  mean  and  median  survival  time  for  all  mice,  without  regard 
to  gender. 

b.  Compute  the  mean  and  median  survival  time  for  the  65  male  mice 
(separately  recorded  in  Large  Data  Set  7A). 

c.  Compute  the  mean  and  median  survival  time  for  the  75  female  mice 
(separately  recorded  in  Large  Data  Set  7B). 
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19. 


27. 


29. 
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ANSWERS 


a.  9. 

b.  41. 

c.  0. 

d.  14. 

3.  x  =  2.5, 'x'  =  2.5, mode  =  {1,2, 3, 4} . 

5.  x  =  3,  X  =  2,  mode  =  2. 

7.  x  =  3,  X  = 2 ,  mode  =  2. 

9.  {0,0,3}. 

11.  {0,1, 1,2}. 

13.  x  =  146.9,  T  =  147.5 

is.  x  =  2.6,  X  =  2,  mode  =  2 

17.  x  =  48.96, 3c  =  49,  mode  =  49 

a.  No,  the  survival  times  of  the  fourth  and  fifth  mice  are  unknown. 

b.  Yes,  X  =  421. 

2i.  x  =  28.55,  =  28, mode  =  28 

23.  x  =  2.05, 1c  =  2,  mode  =  1 

25.  Mean:/lXmin  <  Zx  so  dividing  by  n  yields  Xmjn  <  X,  so  the  minimum  value 
is  not  above  average.  Median:  the  middle  measurement,  or  average  of  the  two 
middle  measurements,  X  ,  is  at  least  as  large  as  Xnijn,  so  the  minimum  value  is 
not  above  average.  Mode:  the  mode  is  one  of  the  measurements,  and  is  not 
greater  than  itself. 

a.  x  =  3. 18,  X  =  3,  mode  =  5. 

b.  x  =  6. 18,  X  =  6,  mode  =  8. 

c.  x  =  —2.  8 1 ,  a"  =  -3,  mode  =  - 1 . 

d.  If  a  number  is  added  to  every  measurement  in  a  data  set,  then  the  mean, 
median,  and  mode  all  change  by  that  number. 

a.  =  1528.74 

b.  J  =  1502.8 

c.  x  =  1535.2 
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31.  a.  X  =  553.4286  and  X  =  552.5 
b.  x  =  665.9692  and  X  =  667 
C.  J  =  455.8933  andT  =  448 
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2.3  Measures  of  Variability 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  variability  of  a  data  set. 

2.  To  learn  how  to  compute  three  measures  of  the  variability  of  a  data  set: 
the  range,  the  variance,  and  the  standard  deviation. 


Look  at  the  two  data  sets  in  Table  2.1  "Two  Data  Sets"  and  the  graphical 
representation  of  each,  called  a  dot  plot,  in  Figure  2.10  "Dot  Plots  of  Data  Sets". 


Table  2.1  Two  Data  Sets 


Data  Set  I: 

40 

38 

42 

40 

39 

39 

43 

40 

39 

40 

Data  Set  II: 

46 

37 

40 

33 

42 

36 

40 

47 

34 

45 

Figure  2.10  Dot  Plots  of  Data  Sets 


• 

•  • 

•  • 

• 

•  •  •  •  • 

•  •  ••  •  •  ••• 

32  34  36  38  40  42  44  46  48  50 

32  34  36  38  40  42  44  46  48  50 

(a)  Set  I 

(b)  Set  II 

The  two  sets  of  ten  measurements  each  center  at  the  same  value:  they  both  have 
mean,  median,  and  mode  40.  Nevertheless  a  glance  at  the  figure  shows  that  they  are 
markedly  different.  In  Data  Set  I  the  measurements  vary  only  slightly  from  the 
center,  while  for  Data  Set  II  the  measurements  vary  greatly.  Just  as  we  have 
attached  numbers  to  a  data  set  to  locate  its  center,  we  now  wish  to  associate  to  each 
data  set  numbers  that  measure  quantitatively  how  the  data  either  scatter  away 
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from  the  center  or  cluster  close  to  it.  These  new  quantities  are  called  measures  of 
variability,  and  we  will  discuss  three  of  them. 

The  Range 

The  first  measure  of  variability  that  we  discuss  is  the  simplest. 

Definition 

The  range9  of  a  data  set  is  the  number  R  defined  by  the  formula 

R  —  -Amax  -^niin 

where  xmax  is  the  largest  measurement  in  the  data  set  and  vmm  is  the  smallest. 


EXAMPLE  10 


Find  the  range  of  each  data  set  in  Table  2,1  "Two  Data  Sets". 

Solution: 

For  Data  Set  I  the  maximum  is  43  and  the  minimum  is  38,  so  the  range  is 

R  =  43  -  38  =  5. 

For  Data  Set  II  the  maximum  is  47  and  the  minimum  is  33,  so  the  range  is 

R  =  41  -  33  =  14. 


The  range  is  a  measure  of  variability  because  it  indicates  the  size  of  the  interval 
over  which  the  data  points  are  distributed.  A  smaller  range  indicates  less  variability 
(less  dispersion)  among  the  data,  whereas  a  larger  range  indicates  the  opposite. 

The  Variance  and  the  Standard  Deviation 

9.  The  variability  of  a  data  set  as 
measured  by  the  number 
R  =  ''-max  ''-min  • 


The  other  two  measures  of  variability  that  we  will  consider  are  more  elaborate  and 
also  depend  on  whether  the  data  set  is  just  a  sample  drawn  from  a  much  larger 
population  or  is  the  whole  population  itself  (that  is,  a  census). 


2.3  Measures  of  Variability 
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Definition 


The  sample  variance  of  a  set  ofn  sample  data  is  the  number  s2  defined  by  the 
formula 


2  H(x  —  x)2 


which  by  algebra  is  equivalent  to  the  formula 

2  W2 

r  =  - i - 

n—  1 

The  sample  standard  deviation10  of  a  set  ofn  sample  data  is  the  square  root  of  the 
sample  variance,  hence  is  the  number  s  given  by  the  formulas 


S  = 


E(x  —  x)2 


n —  1 


£x2  -  1  (Zx)2 
n—  1 


Although  the  first  formula  in  each  case  looks  less  complicated  than  the  second,  the 
latter  is  easier  to  use  in  hand  computations,  and  is  called  a  shortcut  formula. 


10.  The  variability  of  sample  data 
as  measured  by  the  number 


£(x-Fr 

n—  1 


2.3  Measures  of  Variability 
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EXAMPLE  11 


Find  the  sample  variance  and  the  sample  standard  deviation  of  Data  Set  II  in 
Table  2.1  "Two  DataSets". 


Solution: 

To  use  the  defining  formula  (the  first  formula)  in  the  definition  we  first 
compute  for  each  observation  x  its  deviation  X  —  X  from  the  sample  mean. 
Since  the  mean  of  the  data  is  X  =  40,  we  obtain  the  ten  numbers  displayed 
in  the  second  line  of  the  supplied  table. 


X 

46 

37  40 

33 

42 

36 

40 

47 

34 

45 

x  —  X 

6 

-3  0 

-7 

2 

-4 

0 

7 

-6 

5 

Then 

Z(x  -  x)2  =  62  +  (-3)2  +  02  +  (-7)2  +  22  +  (-4)2  +  02  +  72  +  (-6) 

so 


and 


H(x  —  x)2 

n—  1 


224 

~9~ 


24.8 


s  =  V24~8  «  4.99 


The  student  is  encouraged  to  compute  the  ten  deviations  for  Data  Set  I  and  verify 
that  their  squares  add  up  to  20,  so  that  the  sample  variance  and  standard  deviation 
of  Data  Set  I  are  the  much  smaller  numbers  s2  —  20  /  9  =  2.  2  and 
5  =  ^2079  «  1.49. 
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EXAMPLE  12 


Find  the  sample  variance  and  the  sample  standard  deviation  of  the  ten  GPAs 
in  Note  2.12  "Example  3"  in  Section  2.2  "Measures  of  Central  Location". 

1.90  3.00  2.53  3.71  2.12  1.76  2.71  1.39  4.00  3.33 

Solution: 


Since 

Ex  =  1.90  +  3.00  +  2.53  +  3.71  +  2.12  +  1.76  +  2.71  +  1.39  +  4.00  +  3. 

and 


Ex2 


1.902  +  3. 00 2  +  2.532  +  3.7 1 2  +  2.122  +  1.762 
+2.71 2  +  1.392  +4.002  +  3.332 

76.7321 


the  shortcut  formula  gives 


.V 


2 


Ex2  -  I  (Ex)2  _  76.7321  -  -  ^ 
n- 1  “  10-1 


6.77185 

9 


=. 752427 


and 

J  -  V-  752427  «.  867 


The  sample  variance  has  different  units  from  the  data.  For  example,  if  the  units  in 
the  data  set  were  inches,  the  new  units  would  be  inches  squared,  or  square  inches. 
It  is  thus  primarily  of  theoretical  importance  and  will  not  be  considered  further  in 
this  text,  except  in  passing. 


if  the  data  set  comprises  the  whole  population,  then  the  population  standard 
deviation,  denoted  a  (the  lower  case  Greek  letter  sigma),  and  its  square,  the 
population  variance  o 2,  are  defined  as  follows. 


2.3  Measures  of  Variability 
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Definition 


The  population  variance  and  population  standard  deviation11  of  a  set  ofN 

population  data  are  the  numbers  o2  and  a  defined  by  the  formulas 


Note  that  the  denominator  in  the  fraction  is  the  full  number  of  observations,  not 
that  number  reduced  by  one,  as  is  the  case  with  the  sample  standard  deviation. 
Since  most  data  sets  are  samples,  we  will  always  work  with  the  sample  standard 
deviation  and  variance. 


Finally,  in  many  real-life  situations  the  most  important  statistical  issues  have  to  do 
with  comparing  the  means  and  standard  deviations  of  two  data  sets.  Figure  2,11 
"Difference  between  Two  Data  Sets"  illustrates  how  a  difference  in  one  or  both  of 
the  sample  mean  and  the  sample  standard  deviation  are  reflected  in  the  appearance 
of  the  data  set  as  shown  by  the  curves  derived  from  the  relative  frequency 
histograms  built  using  the  data. 


11.  The  variability  of  population 
data  as  measured  by  the 

2  _  Z(*-/0 2 


number  <7  “ 


V 
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Figure  2.11  Difference  between  Two  Data  Sets 


(a)  Two  Identical  Sets  (b)  Locations  Differ 


(c)  Variabilities  Differ 


(d)  Locations  and  Variabilities  Differ 


KEY  TAKEAWAY 


The  range,  the  standard  deviation,  and  the  variance  each  give  a  quantitative 
answer  to  the  question  “How  variable  are  the  data?” 


2.3  Measures  of  Variability 


63 


Chapter  2  Descriptive  Statistics 


1.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  following 
sample. 

12  3  4 

2.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  following 
sample. 

2  -3  6  0  3  1 

3.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  following 
sample. 

2  12  7 

4.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  following 
sample. 

-101411 

5.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  sample 
represented  by  the  data  frequency  table. 


X 

1  2  7 

/ 

1  2  1 

6.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  sample 
represented  by  the  data  frequency  table. 


X 

-10  14 

/ 

113  1 

APPLICATIONS 


7.  Find  the  range,  the  variance,  and  the  standard  deviation  for  the  sample  of  ten 
IQ  scores  randomly  selected  from  a  school  for  academically  gifted  students. 

132  162  133  145  148 
139  147  160  150  153 

8.  Find  the  range,  the  variance  and  the  standard  deviation  for  the  sample  of  ten 
IQ  scores  randomly  selected  from  a  school  for  academically  gifted  students. 
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142  152  138  145  148 
139  147  155  150  153 


ADDITIONAL  EXERCISES 


9.  Consider  the  data  set  represented  by  the  table 


X 

26 

27 

28 

29 

30 

31 

32 

/ 

3 

4 

16 

12 

6 

2 

1 

a.  Use  the  frequency  table  to  find  that  Xx  =  1256  andZx“  =  35,926. 

b.  Use  the  information  in  part  (a)  to  compute  the  sample  mean  and  the 
sample  standard  deviation. 


10.  Find  the  sample  standard  deviation  for  the  data 


X 

1 

2 

3 

4 

5 

/ 

384 

208 

98 

56 

28 

x  6  7  8  9  10 
/  12  8  2  3  1 


11.  A  random  sample  of  49  invoices  for  repairs  at  an  automotive  body  shop  is 
taken.  The  data  are  arrayed  in  the  stem  and  leaf  diagram  shown.  (Stems  are 
thousands  of  dollars,  leaves  are  hundreds,  so  that  for  example  the  largest 
observation  is  3,800.) 


3 

3 

2 

2 

1 

1 

0 

0 


For  these  data,  Xx  = 


5  6  8 

0  0  112  4 
566778899 
0  0  0  0  1  2  2  4 
55566777889 
0  0  1  3  4  4  4 
5  6  8  8 
4 

101,100  ,Xx2  =  244,830,000. 


a.  Compute  the  mean,  median,  and  mode. 

b.  Compute  the  range. 


2.3  Measures  of  Variability 


65 


Chapter  2  Descriptive  Statistics 


c.  Compute  the  sample  standard  deviation. 

12.  What  must  be  true  of  a  data  set  if  its  standard  deviation  is  0? 

13.  A  data  set  consisting  of  25  measurements  has  standard  deviation  0.  One  of  the 
measurements  has  value  17.  What  are  the  other  24  measurements? 

14.  Create  a  sample  data  set  of  size  n  =  3  for  which  the  range  is  0  and  the  sample 
mean  is  2. 

15.  Create  a  sample  data  set  of  size  n  =  3  for  which  the  sample  variance  is  0  and  the 
sample  mean  is  1. 

16.  The  sample  {  —  1,0,1 }  has  mean  X  =  0  and  standard  deviation  s  =  1.  Create 
a  sample  data  set  of  size  n  =  3  for  which  X  =  0  and  s  is  greater  than  1. 

17.  The  sample  {  —  1,0,1}  has  mean  X  =  0  and  standard  deviation  s  =  1.  Create 
a  sample  data  set  of  size  n  =  3  for  which  X  =  0  and  the  standard  deviation  s  is 
less  than  1. 

18.  Begin  with  the  following  set  of  data,  call  it  Data  Set  I. 

5  -2  6  14  -3  0  1  4  3  2  5 

a.  Compute  the  sample  standard  deviation  of  Data  Set  I. 

b.  Form  a  new  data  set,  Data  Set  II,  by  adding  3  to  each  number  in  Data  Set  I. 
Calculate  the  sample  standard  deviation  of  Data  Set  II. 

c.  Form  a  new  data  set,  Data  Set  III,  by  subtracting  6  from  each  number  in 
Data  Set  I.  Calculate  the  sample  standard  deviation  of  Data  Set  III. 

d.  Comparing  the  answers  to  parts  (a),  (b),  and  (c),  can  you  guess  the  pattern? 
State  the  general  principle  that  you  expect  to  be  true. 


LARGE  DATA  SET  EXERCISES 


19.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 
http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ datal.xls 

a.  Compute  the  range  and  sample  standard  deviation  of  the  1,000  SAT  scores. 

b.  Compute  the  range  and  sample  standard  deviation  of  the  1,000  GPAs. 

20.  Large  Data  Set  1  lists  the  SAT  scores  of  1,000  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  students  at  a  high  school,  in 
which  the  SAT  score  of  every  student  was  measured.  Compute  the 
population  range  and  population  standard  deviation  a. 
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b.  Regard  the  first  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  range  and  sample  standard  deviation  s 
and  compare  them  to  the  population  range  and  a. 

c.  Regard  the  next  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  range  and  sample  standard  deviation  s 
and  compare  them  to  the  population  range  and  o. 

21.  Large  Data  Set  1  lists  the  GPAs  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  freshman  at  a  small  college 
at  the  end  of  their  first  academic  year  of  college  study,  in  which  the  GPA  of 
every  such  person  was  measured.  Compute  the  population  range  and 
population  standard  deviation  o. 

b.  Regard  the  first  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  range  and  sample  standard  deviation  s 
and  compare  them  to  the  population  range  and  a. 

c.  Regard  the  next  25  observations  as  a  random  sample  drawn  from  this 
population.  Compute  the  sample  range  and  sample  standard  deviation  s 
and  compare  them  to  the  population  range  and  o. 

22.  Large  Data  Sets  7,  7A,  and  7B  list  the  survival  times  in  days  of  140  laboratory 
mice  with  thymic  leukemia  from  onset  to  death. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7A.xls 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7B.xls 

a.  Compute  the  range  and  sample  standard  deviation  of  survival  time  for  all 
mice,  without  regard  to  gender. 

b.  Compute  the  range  and  sample  standard  deviation  of  survival  time  for  the 
65  male  mice  (separately  recorded  in  Large  Data  Set  7A). 

c.  Compute  the  range  and  sample  standard  deviation  of  survival  time  for  the 
75  female  mice  (separately  recorded  in  Large  Data  Set  7B).  Do  you  see  a 
difference  in  the  results  for  male  and  female  mice?  Does  it  appear  to  be 
significant? 
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ANSWERS 


1.  R  =  3,  s2  =  1.7,  s  =  1.3. 

3.  R  =  6,S2  =  7.  3,  s  =  2.7. 

5.  R  =  6,  s2  =  7.3,  s  =  2.7. 

7.  R  =  30,  s2  =  103.2,  s=  10.2. 

9.  X  =  28.55, s=  1.3. 

li.  a.  x  =  2063,  X  =  2000, mode  =  2000. 

b.  R  =  3400. 

c.  s  =  869. 


13.  All  are  17. 


15.  {1,1,1} 


17. 


One  example  is  .  5,0,  .  5  j  . 


19. 


a.  R  =  1350  and  s  =  212.5455 

b.  R  =  4.00  and  s  =  0.7407 


21.  a.  R  =  4.00  and  a  =  0.740375 

b.  R  =  3.04  and  s  =  0.808045 

c.  R  =  2.49  and  s  =  0.657843 
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2.4  Relative  Position  of  Data 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  relative  position  of  an  element  of  a  data  set. 

2.  To  learn  the  meaning  of  each  of  two  measures,  the  percentile  rank  and 
the  z-score,  of  the  relative  position  of  a  measurement  and  how  to 
compute  each  one. 

3.  To  learn  the  meaning  of  the  three  quartiles  associated  to  a  data  set  and 
how  to  compute  them. 

4.  To  learn  the  meaning  of  the  five-number  summary  of  a  data  set,  how  to 
construct  the  box  plot  associated  to  it,  and  how  to  interpret  the  box 
plot. 


When  you  take  an  exam,  what  is  often  as  important  as  your  actual  score  on  the 
exam  is  the  way  your  score  compares  to  other  students’  performance,  if  you  made  a 
70  but  the  average  score  (whether  the  mean,  median,  or  mode)  was  85,  you  did 
relatively  poorly,  if  you  made  a  70  but  the  average  score  was  only  55  then  you  did 
relatively  well.  In  general,  the  significance  of  one  observed  value  in  a  data  set 
strongly  depends  on  how  that  value  compares  to  the  other  observed  values  in  a  data 
set.  Therefore  we  wish  to  attach  to  each  observed  value  a  number  that  measures  its 
relative  position. 

Percentiles  and  Quartiles 

Anyone  who  has  taken  a  national  standardized  test  is  familiar  with  the  idea  of  being 
given  both  a  score  on  the  exam  and  a  “percentile  ranking”  of  that  score.  You  may 
be  told  that  your  score  was  625  and  that  it  is  the  85th  percentile.  The  first  number 
tells  how  you  actually  did  on  the  exam;  the  second  says  that  85%  of  the  scores  on 
the  exam  were  less  than  or  equal  to  your  score,  625. 


12.  The  measurement  x,  if  it  exists, 
such  that  P  percent  of  the  data 
are  less  than  or  equal  to  x. 

13.  Of  a  measurement  x,  the 
percentage  of  the  data  that  are 
less  than  or  equal  to  x. 


Definition 

Given  an  observed  value  x  in  a  data  set,  x  is  the  Pth  percentile12  of  the  data  if  the 
percentage  of  the  data  that  are  less  than  or  equal  to  x  is  P.  The  number  P  is  the 

percentile  rank13  ofx. 
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EXAMPLE  13 


What  percentile  is  the  value  1.39  in  the  data  set  of  ten  GPAs  considered  in 
Note  2.12  "Example  3"  in  Section  2.2  "Measures  of  Central  Location"?  What 
percentile  is  the  value  3.33? 

Solution: 

The  data  written  in  increasing  order  are 

1.39  1.76  1.90  2.12  2.53  2.71  3.00  3.33  3.71  4.00 

The  only  data  value  that  is  less  than  or  equal  to  1.39  is  1.39  itself.  Since  1  is 
1/10  =  .10  or  10%  of  10,  the  value  1.39  is  the  10th  percentile.  Eight  data  values 
are  less  than  or  equal  to  3.33.  Since  8  is  8T0  =  .80  or  80%  of  10,  the  value  3.33 
is  the  80th  percentile. 


The  Pth  percentile  cuts  the  data  set  in  two  so  that  approximately  P%  of  the  data  lie 
below  it  and  (100  —  P)Yo  of  the  data  lie  above  it.  In  particular,  the  three  percentiles 
that  cut  the  data  into  fourths,  as  shown  in  Figure  2.12  "Data  Division  by  Quartiles", 
are  called  the  quartiles14.  The  following  simple  computational  definition  of  the 
three  quartiles  works  well  in  practice. 


14.  Of  a  data  set,  the  three 
numbers  Q\ ,  Q2 ,  Q3  that 
divide  the  data  approximately 
into  fourths. 
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Definition 

For  any  data  set: 

1.  The  second  quartile  Q2  of  the  data  set  is  its  median. 

2.  Define  two  subsets: 

1.  the  lower  set:  all  observations  that  are  strictly  less  than  Q2 ; 

2.  the  upper  set:  all  observations  that  are  strictly  greater  than  Q2  ■ 

3.  The  first  quartile  Ql  of  the  data  set  is  the  median  of  the  lower  set. 

4.  The  third  quartile  Q2  of  the  data  set  is  the  median  of  the  upper  set. 
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EXAMPLE  14 


Find  the  quartiles  of  the  data  set  of  GPAs  of  Note  2.12  "Example  3"  in  Section 
2.2  "Measures  of  Central  Location". 


Solution: 

As  in  the  previous  example  we  first  list  the  data  in  numerical  order: 

1.39  1.76  1.90  2.12  2.53  2.71  3.00  3.33  3.71  4.00 

This  data  set  has  n  =  10  observations.  Since  10  is  an  even  number,  the  median 
is  the  mean  of  the  two  middle  observations: 

X  =  (2.53  +  2.71)  /  2  =  2.62.  Thus  the  second  quartile  is 
Q2  —  2.62.  The  lower  and  upper  subsets  are 

Lower:  L  =  {1.39,1.76,1.90,2.12,2.53} 

Upper:  U  =  {2.71,3.00,3.33,3.71,4.00} 

Each  has  an  odd  number  of  elements,  so  the  median  of  each  is  its  middle 
observation.  Thus  the  first  quartile  is  Qy  =  1 .90,  the  median  of  L,  and  the 
third  quartile  is  Q 3  =  3.33 ,  the  median  of  U. 


2.4  Relative  Position  of  Data 


72 


Chapter  2  Descriptive  Statistics 


EXAMPLE  15 


Adjoin  the  observation  3.88  to  the  data  set  of  the  previous  example  and  find 
the  quartiles  of  the  new  set  of  data. 

Solution: 

As  in  the  previous  example  we  first  list  the  data  in  numerical  order: 

1.39  1.76  1.90  2.12  2.53  2.71  3.00  3.33  3.71  3.88  4.00 

This  data  set  has  11  observations.  The  second  quartile  is  its  median,  the 
middle  value  2.71.  Thus  Q2  —  2.7 1 .  The  lower  and  upper  subsets  are  now 

Lower:  L  =  {1.39,1.76,1.90,2.12,2.53} 

Upper:  U  =  {3.00,3.33,3.71,3.88,4.00} 

The  lower  set  I  has  median  the  middle  value  1.90,  so  Q 1  =  1.90.  The 
upper  set  has  median  the  middle  value  3.71,  so  Q 3  =  3.71. 


In  addition  to  the  three  quartiles,  the  two  extreme  values,  the  minimum  A'mni  and 
the  maximum  xmax  are  also  useful  in  describing  the  entire  data  set.  Together  these 
five  numbers  are  called  the  five-number  summary15  of  the  data  set: 

{-C  min  1  Q\  >  02’  03  1  A  max  } 

The  five-number  summary  is  used  to  construct  a  box  plot16  as  in  Figure  2.13  "The 
Box  Plot".  Each  of  the  five  numbers  is  represented  by  a  vertical  line  segment,  a  box 
is  formed  using  the  line  segments  at  Qj  and  Q3  as  its  two  vertical  sides,  and  two 
horizontal  line  segments  are  extended  from  the  vertical  segments  marking  Q  ]  and 
Q 3  to  the  adjacent  extreme  values.  (The  two  horizontal  line  segments  are  referred 
to  as  “whiskers,”  and  the  diagram  is  sometimes  called  a  “box  and  whisker  plot.”) 

15.  Of  a  data  set,  the  list 
Xmin  1  Q 1  5  0,2 »  Q3  >  T max 

16.  For  a  data  set,  a  diagram 
constructed  using  the  five- 
number  summary,  as  in  Figure 
2,13  "The  Box  Plot",  which 
graphically  summarizes  the 
distribution  of  the  data. 


We  caution  the  reader  that  there  are  other  types  of  box  plots  that  differ  somewhat 
from  the  ones  we  are  constructing,  although  all  are  based  on  the  three  quartiles. 
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Figure  2.13  The  Box  Plot 
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Note  that  the  distance  from  Ql  to  Q3  is  the  length  of  the  interval  over  which  the 
middle  half  of  the  data  range.  Thus  it  has  the  following  special  name. 


Definition 

The  interquartile  range  (IQR)17  is  the  quantity 

m  =  q3-Qi 


17.  Of  a  data  set,  the  difference 
between  the  first  and  third 
quar  tiles. 
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EXAMPLE  16 


Construct  a  box  plot  and  find  the  IQR  for  the  data  in  Note  2.44  "Example  14". 
Solution: 

From  our  work  in  Note  2.44  "Example  14"  we  know  that  the  five-number 
summary  is 

xmin  =  1-39  Qx  =  1-90  Q2  =  2.62  Q3  =  3.33  *max  =  4.00 

The  box  plot  is 


- • . -♦  -  — -• - * - 

1.39  1.90  2.63  3.33  4.00 

The  interquartile  range  is  IQR  =  3.33  —  1.90  =  1.43. 

z-scores 

Another  way  to  locate  a  particular  observation  x  in  a  data  set  is  to  compute  its 
distance  from  the  mean  in  units  of  standard  deviation. 
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Definition 


The  z-score18  of  an  observation  x  is  the  number  z  given  by  the  computational  formula 


X  —  X  X  —  pi 

Z  =  -  or  z  =  - 

s  o 


according  to  whether  the  data  set  is  a  sample  or  is  the  entire  population. 


The  formulas  in  the  definition  allow  us  to  compute  the  z-score  when  x  is  known,  if 
the  z-score  is  known  then  x  can  be  recovered  using  the  corresponding  inverse 
formulas 


x  =  x  +  sz  or  x  =  fi  +  oz 

The  z-score  indicates  how  many  standard  deviations  an  individual  observation  x  is 
from  the  center  of  the  data  set,  its  mean,  if  z  is  negative  then  x  is  below  average,  if  z 
is  0  then  x  is  equal  to  the  average,  if  z  is  positive  then  x  is  above  average.  See  Figure 
2.14. 


18.  Of  a  measurement  x,  the 

distance  of  x  from  the  mean  in 
units  of  standard  deviation. 
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Figure  2.14  x-Scale  versus  z-Score 
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2.4  Relative  Position  of  Data 


EXAMPLE  17 


Find  the  z-scores  for  all  ten  observations  in  the  GPA  sample  data  in  Note  2.12 
"Example  3"  in  Section  2.2  "Measures  of  Central  Location". 

1.90  3.00  2.53  3.71  2.12  1.76  2.71  1.39  4.00  3.33 


Solution: 


For  these  data  X  =  2.645  and  s  =  0.8674.  The  first  observation  x  =  1.9  in 
the  data  set  has  z-score 


Z  = 


1.9-2.645 

0.8674 


-0.8589 


which  means  that  x  =  1.90  is  0.8589  standard  deviations  below  the  sample 
mean.  The  second  observation  x  =  3.00  has  z-score 


Z  = 


3.00  -  2.645 
0.8674 


0.4093 


which  means  that  x  =  3.00  is  0.4093  standard  deviations  above  the  sample 
mean.  Repeating  the  process  for  the  remaining  observations  gives  the  full 
set  of  z-scores 


-0.86  0.41  -0.13 


1.23  -0.61 


-1.02  0.07  -1.45  1.56  0.79 
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EXAMPLE  18 


Suppose  the  mean  and  standard  deviation  of  the  GPAs  of  all  currently 
registered  students  at  a  college  are  /j.  =  2.70  and  a  =  0.50.  The  z-scores  of  the 
GPAs  of  two  students,  Antonio  and  Beatrice,  are  Z  =  —0.62  and  z  =  1.28, 
respectively.  What  are  their  GPAs? 

Solution: 

Using  the  second  formula  right  after  the  definition  of  z-scores  we  compute 
the  GPAs  as 

Antonio:  x  =  n  +  z  o  —  2.70  +  (-0.62)  (0.50)  =  2.39 
Beatrice:  x  =  /i  +  z  o  =  2.70  +  (1.28)  (0.50)  =  3.34 


KEY  TAKEAWAYS 


•  The  percentile  rank  and  z-score  of  a  measurement  indicate  its  relative 
position  with  regard  to  the  other  measurements  in  a  data  set. 

•  The  three  quartiles  divide  a  data  set  into  fourths. 

•  The  five-number  summary  and  its  associated  box  plot  summarize  the 
location  and  distribution  of  the  data. 
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69  92  68  77  80 

93  75  76  82  100 

70  85  88  85  96 

53  70  70  82  85 

a.  Find  the  percentile  rank  of  82. 

b.  Find  the  percentile  rank  of  68. 

2.  Consider  the  data  set 

8.5  8.2  7.0  7.0  4.9 

9.6  8.5  8.8  8.5  8.7 

6.5  8.2  7.6  1.5  9.3 

8.0  7.7  2.9  9.2  6.9 

a.  Find  the  percentile  rank  of  6.5. 

b.  Find  the  percentile  rank  of  7.7. 


3.  Consider  the  data  set  represented  by  the  ordered  stem  and  leaf  diagram 


10 

9 

8 

7 

6 

5 

4 

3 


0  0 

11112  3 

01  1223457889 

000112445666777889 

012223445777788 

02334467789 

2  5  6  8  8 

9  9 


a.  Find  the  percentile  rank  of  the  grade  75. 

b.  Find  the  percentile  rank  of  the  grade  57. 


4.  Is  the  90th  percentile  of  a  data  set  always  equal  to  90%?  Why  or  why  not? 
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5.  The  29th  percentile  in  a  large  data  set  is  5. 

a.  Approximately  what  percentage  of  the  observations  are  less  than  5? 

b.  Approximately  what  percentage  of  the  observations  are  greater  than  5? 

6.  The  54th  percentile  in  a  large  data  set  is  98.6. 

a.  Approximately  what  percentage  of  the  observations  are  less  than  98.6? 

b.  Approximately  what  percentage  of  the  observations  are  greater  than  98.6? 

7.  In  a  large  data  set  the  29th  percentile  is  5  and  the  79th  percentile  is  10. 
Approximately  what  percentage  of  observations  lie  between  5  and  10? 

8.  In  a  large  data  set  the  40th  percentile  is  125  and  the  82nd  percentile  is  158. 
Approximately  what  percentage  of  observations  lie  between  125  and  158? 

9.  Find  the  five-number  summary  and  the  IQR  and  sketch  the  box  plot  for  the 
sample  represented  by  the  stem  and  leaf  diagram  in  Figure  2.2  "Ordered  Stem 
and  Leaf  Diagram". 

10.  Find  the  five-number  summary  and  the  IQR  and  sketch  the  box  plot  for  the 
sample  explicitly  displayed  in  Note  2.20  "Example  7"  in  Section  2.2  "Measures 
of  Central  Location". 

11.  Find  the  five-number  summary  and  the  IQR  and  sketch  the  box  plot  for  the 
sample  represented  by  the  data  frequency  table 


X 

1 

2 

5 

8 

9 

/ 

5 

2 

3 

6 

4 

12.  Find  the  five-number  summary  and  the  IQR  and  sketch  the  box  plot  for  the 
sample  represented  by  the  data  frequency  table 


X 

1 

Ln 

1 

-2-101 

3 

4 

5 

/ 

2  1 

3  2  4  1 

1 

2 

1 

13.  Find  the  z-score  of  each  measurement  in  the  following  sample  data  set. 

-562-10 

14.  Find  the  z-score  of  each  measurement  in  the  following  sample  data  set. 

1.6  5.2  2.8  3.7  4.0 

15.  The  sample  with  data  frequency  table 


X 

1  2  7 

/ 

1  2  1 
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has  mean  X  =  3  and  standard  deviation  s  »  2.71.  Find  the  z-score  for  every 
value  in  the  sample. 


16.  The  sample  with  data  frequency  table 


X 

-10  14 

/ 

113  1 

has  mean X  =  1  and  standard  deviations  =  1.67.  Find  the  z-score  for  every 
value  in  the  sample. 


17.  For  the  population 


0  0  2  2 


compute  each  of  the  following. 


a.  The  population  mean  /z. 

b.  The  population  variance  ci2. 

c.  The  population  standard  deviation  o. 

d.  The  z-score  for  every  value  in  the  population  data  set. 

18.  For  the  population 

0.5  2.1  4.4  1.0 


compute  each  of  the  following. 


a.  The  population  mean  fi. 

b.  The  population  variance  cr2. 

c.  The  population  standard  deviation  a. 

d.  The  z-score  for  every  value  in  the  population  data  set. 

19.  A  measurement  x  in  a  sample  with  mean  X  =  1 0  and  standard  deviation  s  =  3 
has  z-score  z  =  2.  Find  x. 

20.  A  measurement  x  in  a  sample  with  mean  X  =  1 0  and  standard  deviation  s  =  3 
has  z-score  Z  =  —  1.  Findx. 


21.  A  measurement  x  in  a  population  with  mean  /z  =  2.3  and  standard  deviation  a  = 
1.3  has  z-score  z  =  2.  Find  x. 


22.  A  measurement  x  in  a  sample  with  mean  /z  =  2.3  and  standard  deviation  a  =  1.3 
has  z-score  Z  =  —1.2.  Findx. 
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APPLICATIONS 


23.  The  weekly  sales  for  the  last  20  weeks  in  a  kitchen  appliance  store  for  an 
electric  automatic  rice  cooker  are 


20 

15 

14 

14 

18 

15 

19 

12 

13 

9 

15 

17 

16 

16 

18 

19 

15 

15 

16 

15 

a.  Find  the  percentile  rank  of  15. 

b.  If  the  sample  accurately  reflects  the  population,  then  what  percentage  of 
weeks  would  an  inventory  of  15  rice  cookers  be  adequate? 

24.  The  table  shows  the  number  of  vehicles  owned  in  a  survey  of  52  households. 


X 

0  1 

2 

3 

4 

5 

6  7 

/ 

2  12 

15 

11 

6 

3 

1  2 

a.  Find  the  percentile  rank  of  2. 

b.  If  the  sample  accurately  reflects  the  population,  then  what  percentage  of 
households  have  at  most  two  vehicles? 

25.  For  two  months  Cordelia  records  her  daily  commute  time  to  work  each  day  to 
the  nearest  minute  and  obtains  the  following  data: 


X 

26 

27 

28 

29 

30 

31 

32 

/ 

3 

4 

16 

12 

6 

2 

1 

Cordelia  is  supposed  to  be  at  work  at  8:00  a.m.  but  refuses  to  leave  her  house 
before  7:30  a.m. 

a.  Find  the  percentile  rank  of  30,  the  time  she  has  to  get  to  work. 

b.  Assuming  that  the  sample  accurately  reflects  the  population  of  all  of 
Cordelia’s  commute  times,  use  your  answer  to  part  (a)  to  predict  the 
proportion  of  the  work  days  she  is  late  for  work. 

26.  The  mean  score  on  a  standardized  grammar  exam  is  49.6;  the  standard 
deviation  is  1.35.  Dromio  is  told  that  the  z-score  of  his  exam  score  is  -1.19. 

a.  Is  Dromio’s  score  above  average  or  below  average? 

b.  What  was  Dromio’s  actual  score  on  the  exam? 

27.  A  random  sample  of  49  invoices  for  repairs  at  an  automotive  body  shop  is 
taken.  The  data  are  arrayed  in  the  stem  and  leaf  diagram  shown.  (Stems  are 
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thousands  of  dollars,  leaves  are  hundreds,  so  that  for  example  the  largest 
observation  is  3,800.) 
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For  these  data,  Hx  =  101,100  ,  £x2  =  244,830,000. 


a.  Find  the  z-score  of  the  repair  that  cost  $1,100. 

b.  Find  the  z-score  of  the  repairs  that  cost  $2,700. 


28.  The  stem  and  leaf  diagram  shows  the  time  in  seconds  that  callers  to  a 
telephone-order  center  were  on  hold  before  their  call  was  taken. 


0 

0 

1 

1 

2 

2 
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000000111  1  1  11122222333 

5555555556666666666777 

001111222244 

5  6  6  8  9 

2  4 

5 
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3  3 
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a.  Find  the  quartiles. 

b.  Give  the  five-number  summary  of  the  data. 

c.  Find  the  range  and  the  IQR. 


ADDITIONAL  EXERCISES 


29.  Consider  the  data  set  represented  by  the  ordered  stem  and  leaf  diagram 
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a.  Find  the  three  quartiles. 

b.  Give  the  five-number  summary  of  the  data. 

c.  Find  the  range  and  the  IQR. 


30.  For  the  following  stem  and  leaf  diagram  the  units  on  the  stems  are  thousands 
and  the  units  on  the  leaves  are  hundreds,  so  that  for  example  the  largest 
observation  is  3,800. 
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a.  Find  the  percentile  rank  of  800. 

b.  Find  the  percentile  rank  of  3,200. 

31.  Find  the  five-number  summary  for  the  following  sample  data. 
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32.  Find  the  five-number  summary  for  the  following  sample  data. 
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33.  For  the  following  stem  and  leaf  diagram  the  units  on  the  stems  are  thousands 
and  the  units  on  the  leaves  are  hundreds,  so  that  for  example  the  largest 
observation  is  3,800. 


3 

3 

2 

2 

1 

1 

0 

0 


5  6  8 

0  0  112  4 
566778899 
0  0  0  0  1  2  2  4 
55566777889 
0  0  1  3  4  4  4 
5  6  8  8 
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a.  Find  the  three  quartiles. 

b.  Find  the  IQR. 

c.  Give  the  five-number  summary  of  the  data. 


34.  Determine  whether  the  following  statement  is  true.  “In  any  data  set,  if  an 

observation  X  \  is  greater  than  another  observation  X  2 ,  then  the  z-score  of  X  i 
is  greater  than  the  z-score  of  X2 


35.  Emilia  and  Ferdinand  took  the  same  freshman  chemistry  course,  Emilia  in  the 
fall,  Ferdinand  in  the  spring.  Emilia  made  an  83  on  the  common  final  exam  that 
she  took,  on  which  the  mean  was  76  and  the  standard  deviation  8.  Ferdinand 
made  a  79  on  the  common  final  exam  that  he  took,  which  was  more  difficult, 
since  the  mean  was  65  and  the  standard  deviation  12.  The  one  who  has  a 
higher  z-score  did  relatively  better.  Was  it  Emilia  or  Ferdinand? 


36.  Refer  to  the  previous  exercise.  On  the  final  exam  in  the  same  course  the 

following  semester,  the  mean  is  68  and  the  standard  deviation  is  9.  What  grade 
on  the  exam  matches  Emilia’s  performance?  Ferdinand’s? 


37.  Rosencrantz  and  Guildenstern  are  on  a  weight-reducing  diet.  Rosencrantz,  who 
weighs  178  lb,  belongs  to  an  age  and  body-type  group  for  which  the  mean 
weight  is  145  lb  and  the  standard  deviation  is  15  lb.  Guildenstern,  who  weighs 
204  lb,  belongs  to  an  age  and  body-type  group  for  which  the  mean  weight  is 
165  lb  and  the  standard  deviation  is  20  lb.  Assuming  z-scores  are  good 
measures  for  comparison  in  this  context,  who  is  more  overweight  for  his  age 
and  body  type? 
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LARGE  DATA  SET  EXERCISES 


38.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Compute  the  three  quartiles  and  the  interquartile  range  of  the  1,000  SAT 
scores. 

b.  Compute  the  three  quartiles  and  the  interquartile  range  of  the  1,000  GPAs. 

39.  Large  Data  Set  10  records  the  scores  of  72  students  on  a  statistics  exam. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data!0.xls 

a.  Compute  the  five-number  summary  of  the  data. 

b.  Describe  in  words  the  performance  of  the  class  on  the  exam  in  the  light  of 
the  result  in  part  (a). 

40.  Large  Data  Sets  3  and  3A  list  the  heights  of  174  customers  entering  a  shoe 

store. 

http://www.gone.2012books.lardbucket.org/sites/all/files/data3.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data3A.xls 

a.  Compute  the  five-number  summary  of  the  heights,  without  regard  to 
gender. 

b.  Compute  the  five-number  summary  of  the  heights  of  the  men  in  the 
sample. 

c.  Compute  the  five-number  summary  of  the  heights  of  the  women  in  the 
sample. 

41.  Large  Data  Sets  7,  7A,  and  7B  list  the  survival  times  in  days  of  140  laboratory 

mice  with  thymic  leukemia  from  onset  to  death. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7A.xls 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7B.xls 

a.  Compute  the  three  quartiles  and  the  interquartile  range  of  the  survival 
times  for  all  mice,  without  regard  to  gender. 

b.  Compute  the  three  quartiles  and  the  interquartile  range  of  the  survival 
times  for  the  65  male  mice  (separately  recorded  in  Large  Data  Set  7A). 

c.  Compute  the  three  quartiles  and  the  interquartile  range  of  the  survival 
times  for  the  75  female  mice  (separately  recorded  in  Large  Data  Set  7B). 
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ANSWERS 


1.  a.  60. 

b.  10. 

3.  a.  59. 

b.  23. 

5.  a.  29. 

b.  71. 

7.  50%. 

9.  xmin  =  25,  <2i  =  70,  02  =  77.5,  03  =  90,xmax  =  100, 

IQR  =  20 

n.  xmjn  —  IjQi  —  1-5,  02  —  9.5,  (23  —  8,xmax  —  9,  IQR  —  6.5 

13.  -1.3,1.39,0.4,-0.35,-0.11. 

15.  Z  —  —0.74  forx  =  l,  Z  —  —0.37  for  x  =  2,  z  =  1.48  for  x  =  7. 

17.  a.  1. 

b.  1. 

c.  1. 

d.  Z  —  —  1  for  x  =  0,  z  =  1  for  x  =  2. 

19.  16. 

21.  4.9. 

23.  a.  55. 

b.  55. 

25.  a.  93. 

b.  0.07. 

27.  a.  -1.11. 

b.  0.73. 

29.  a.  <2i  =59,e2  =70,<23  =81. 

b.  xmin  =  39,0i  =  59,02  =  70,03  =  81,xmax  =  100. 

c.  R  =  6i,  IQR  =  22. 

3i.  xmin  =  26,0!  =  28,02  =  28,03  =  29,xmax  =  32. 

33.  a.  0j  =  1450,  09  -  2000,03  =  2800. 

b.  IQR  =  1350. 
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c.  xmin  =  400,01  =  1450,(22  -  2000,03  =  2800, 

^max  =  3800. 

35.  Emilia:  Z  —■  875  ,  Ferdinand:  Z  —  1.16. 

37.  Rosencrantz:  z  =  2.2,  Guildenstern:  z  =  1.95.  Rosencrantz  is  more  overweight  for 
his  age  and  body  type. 

39.  a.  X  m [n  —  15,0i  —  51,02  —  67,03  —  82,  &n&Xmax  —  97. 
b.  The  data  set  appears  to  be  skewed  to  the  left. 

41.  a.  0!  =  440,02  =  552.5,03  =  661,  and 707?  =  221. 
b.  0!  =  641,0,  =  667,03  =  700, and/07?  -  59. 

C.  0!  -  407,0,  =  448,03  =  504,  and 707?  =  97. 
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2.5  The  Empirical  Rule  and  Chebyshev’s  Theorem 


LEARNING  OBJECTIVES 

1.  To  learn  what  the  value  of  the  standard  deviation  of  a  data  set  implies 
about  how  the  data  scatter  away  from  the  mean  as  described  by  the 
Empirical  Rule  and  Chebyshev’s  Theorem. 

2.  To  use  the  Empirical  Rule  and  Chebyshev’s  Theorem  to  draw  conclusions 
about  a  data  set. 


You  probably  have  a  good  intuitive  grasp  of  what  the  average  of  a  data  set  says 
about  that  data  set.  In  this  section  we  begin  to  learn  what  the  standard  deviation 
has  to  tell  us  about  the  nature  of  the  data  set. 

The  Empirical  Rule 

We  start  by  examining  a  specific  set  of  data.  Table  2.2  "Heights  of  Men"  shows  the 

heights  in  inches  of  100  randomly  selected  adult  men.  A  relative  frequency 

histogram  for  the  data  is  shown  in  Figure  2.15  "Heights  of  Adult  Men".  The  mean 

and  standard  deviation  of  the  data  are,  rounded  to  two  decimal  places,  x  —  69.92 

and  s  =  1.70.  if  we  go  through  the  data  and  count  the  number  of  observations  that 

are  within  one  standard  deviation  of  the  mean,  that  is,  that  are  between 

69.92  —  1.70  =  68.22  and  69.92  +  1.70  =  7 1.62  inches,  there  are  69  of  them,  if 

we  count  the  number  of  observations  that  are  within  two  standard  deviations  of  the 

mean,  that  is,  that  are  between  69.92  —  2  (1.70)  =  66.52and 

69.92  +  2  (1.70)  =  73.32inches,  there  are  95  of  them.  All  of  the  measurements 

are  within  three  standard  deviations  of  the  mean,  that  is,  between 

69.92  —  3(1.70)  =  64. 822and  69.92  +  3(1.70)  =  75. 02inches.  These  tallies 

are  not  coincidences,  but  are  in  agreement  with  the  following  result  that  has  been 

found  to  be  widely  applicable. 


Table  2.2  Heights  of  Men 
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Figure  2.15  Heights  of  Adult  Men 


2.5  The  Empirical  Rule  and  Chebyshev’s  Theorem 


91 


Chapter  2  Descriptive  Statistics 


The  Empirical  Rule 

if  a  data  set  has  an  approximately  bell-shaped  relative  frequency  histogram, 
then  (see  Figure  2.16  "The  Empirical  Rule") 

1.  approximately  68%  of  the  data  lie  within  one  standard  deviation  of 
the  mean,  that  is,  in  the  interval  with  endpoints  X  ±  s  for  samples 
and  with  endpoints  n  ±  o  for  populations; 

2.  approximately  95%  of  the  data  lie  within  two  standard  deviations 
of  the  mean,  that  is,  in  the  interval  with  endpoints  x  ±2  s  for 
samples  and  with  endpoints  ft  ±  2a  for  populations;  and 

3.  approximately  99.7%  of  the  data  lies  within  three  standard 
deviations  of  the  mean,  that  is,  in  the  interval  with  endpoints 

X  ±  3s  for  samples  and  with  endpoints  fi  ±3o  for  populations. 
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Two  key  points  in  regard  to  the  Empirical  Rule  are  that  the  data  distribution  must 
be  approximately  bell-shaped  and  that  the  percentages  are  only  approximately  true. 
The  Empirical  Rule  does  not  apply  to  data  sets  with  severely  asymmetric 
distributions,  and  the  actual  percentage  of  observations  in  any  of  the  intervals 
specified  by  the  rule  could  be  either  greater  or  less  than  those  given  in  the  rule.  We 
see  this  with  the  example  of  the  heights  of  the  men:  the  Empirical  Rule  suggested  68 
observations  between  68.22  and  71.62  inches  but  we  counted  69. 
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EXAMPLE  19 


Heights  of  18-year-old  males  have  a  bell-shaped  distribution  with  mean  69.6 
inches  and  standard  deviation  1.4  inches. 

a.  About  what  proportion  of  all  such  men  are  between  68.2  and  71  inches 
tall? 

b.  What  interval  centered  on  the  mean  should  contain  about  95%  of  all 
such  men? 

Solution: 

A  sketch  of  the  distribution  of  heights  is  given  in  Figure  2.17  "Distribution  of 
Heights". 

a.  Since  the  interval  from  68.2  to  71.0  has  endpoints  X  —  Sand  X  +  S,  by 
the  Empirical  Rule  about  68%  of  all  18-year-old  males  should  have 
heights  in  this  range. 

b.  By  the  Empirical  Rule  the  shortest  such  interval  has  endpoints 
X  —2 S  and  X  +  2 S.  Since 

x—2s  =  69.6  -  2  (1.4)  =  66.8  and  J  +  2s  =  69.6  +  2  (1.4)  =  72. 

the  interval  in  question  is  the  interval  from  66.8  inches  to  72.4 
inches. 
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Figure  2.17 

Distribution  of  Heights 
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EXAMPLE  20 


Scores  on  IQ  tests  have  a  bell-shaped  distribution  with  mean  /j  =  100  and 
standard  deviation  a=  10.  Discuss  what  the  Empirical  Rule  implies 
concerning  individuals  with  IQ  scores  of  110, 120,  and  130. 

Solution: 

A  sketch  of  the  IQ  distribution  is  given  in  Figure  2.18  "Distribution  of  IQ 
Scores".  The  Empirical  Rule  states  that 

1.  approximately  68%  of  the  IQ  scores  in  the  population  lie  between  90  and 

110, 

2.  approximately  95%  of  the  IQ  scores  in  the  population  lie  between  80  and 
120,  and 

3.  approximately  99.7%  of  the  IQ  scores  in  the  population  lie  between  70 
and  130. 

Figure  2.18 

Distribution  oflQ  Scores 


Since  68%  of  the  IQ  scores  lie  within  the  interval  from  90  to  110,  it  must  be 
the  case  that  32%  lie  outside  that  interval.  By  symmetry  approximately  half 
of  that  32%,  or  16%  of  all  IQ  scores,  will  lie  above  110.  if  16%  lie  above  110, 
then  84%  lie  below.  We  conclude  that  the  IQ  score  110  is  the  84th  percentile. 

The  same  analysis  applies  to  the  score  120.  Since  approximately  95%  of  all  IQ 
scores  lie  within  the  interval  form  80  to  120,  only  5%  lie  outside  it,  and  half 
of  them,  or  2.5%  of  all  scores,  are  above  120.  The  IQ  score  120  is  thus  higher 
than  97.5%  of  all  IQ  scores,  and  is  quite  a  high  score. 
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By  a  similar  argument,  only  15/100  of  1%  of  all  adults,  or  about  one  or  two  in 
every  thousand,  would  have  an  IQ  score  above  130.  This  fact  makes  the  score 
130  extremely  high. 


Chebyshev’s  Theorem 

The  Empirical  Rule  does  not  apply  to  all  data  sets,  only  to  those  that  are  bell¬ 
shaped,  and  even  then  is  stated  in  terms  of  approximations.  A  result  that  applies  to 
every  data  set  is  known  as  Chebyshev’s  Theorem. 


Chebyshev’s  Theorem 

For  any  numerical  data  set, 

1.  at  least  3/ 4  of  the  data  lie  within  two  standard  deviations  of  the 
mean,  that  is,  in  the  interval  with  endpoints  x  ±2  s  for  samples 
and  with  endpoints  ji  ±  2c  for  populations; 

2.  at  least  8/ 9  of  the  data  lie  within  three  standard  deviations  of  the 
mean,  that  is,  in  the  interval  with  endpoints  x  ±  3s  for  samples 
and  with  endpoints  [i  ±3 a  for  populations; 

3.  at  least  1  —  1  /  k2  of  the  data  lie  within  k  standard  deviations  of 
the  mean,  that  is,  in  the  interval  with  endpoints  X  ±ks  for 
samples  and  with  endpoints  n  ±ko  for  populations,  where  k  is  any 
positive  whole  number  that  is  greater  than  1. 


Figure  2.19  "Chebyshev’s  Theorem"  gives  a  visual  illustration  of  Chebyshev’s 
Theorem. 
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Figure  2.19  Chebyshev’s  Theorem 


It  is  important  to  pay  careful  attention  to  the  words  “at  least”  at  the  beginning  of 
each  of  the  three  parts.  The  theorem  gives  the  minimum  proportion  of  the  data 
which  must  lie  within  a  given  number  of  standard  deviations  of  the  mean;  the  true 
proportions  found  within  the  indicated  regions  could  be  greater  than  what  the 
theorem  guarantees. 
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A  sample  of  size  n  =  50  has  mean  X  =  28  and  standard  deviation  s  =  3. 
Without  knowing  anything  else  about  the  sample,  what  can  be  said  about  the 
number  of  observations  that  lie  in  the  interval  (22,34)?  What  can  be  said 
about  the  number  of  observations  that  lie  outside  that  interval? 


Solution: 


The  interval  (22,34)  is  the  one  that  is  formed  by  adding  and  subtracting  two 
standard  deviations  from  the  mean.  By  Chebyshev’s  Theorem,  at  least  3/ 4  of 
the  data  are  within  this  interval.  Since  3/4  of  50  is  37.5,  this  means  that  at 
least  37.5  observations  are  in  the  interval.  But  one  cannot  take  a  fractional 
observation,  so  we  conclude  that  at  least  38  observations  must  lie  inside  the 
interval  (22,34). 

If  at  least  3/4  of  the  observations  are  in  the  interval,  then  at  most  l/4  of 
them  are  outside  it.  Since  l/4  of  50  is  12.5,  at  most  12.5  observations  are 
outside  the  interval.  Since  again  a  fraction  of  an  observation  is  impossible,  x 
(22,34). 
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EXAMPLE  22 


The  number  of  vehicles  passing  through  a  busy  intersection  between  8:00 

а. m.  and  10:00  a.m.  was  observed  and  recorded  on  every  weekday  morning 

of  the  last  year.  The  data  set  contains  n  =  251  numbers.  The  sample  mean  is 

X  =  725  and  the  sample  standard  deviation  is  s  =  25.  Identify  which  of  the 

following  statements  must  be  true. 

1.  On  approximately  95%  of  the  weekday  mornings  last  year  the  number  of 
vehicles  passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m. 
was  between  675  and  775. 

2.  On  at  least  75%  of  the  weekday  mornings  last  year  the  number  of 
vehicles  passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m. 
was  between  675  and  775. 

3.  On  at  least  189  weekday  mornings  last  year  the  number  of  vehicles 
passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m.  was 
between  675  and  775. 

4.  On  at  most  25%  of  the  weekday  mornings  last  year  the  number  of 
vehicles  passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m. 
was  either  less  than  675  or  greater  than  775. 

5.  On  at  most  12.5%  of  the  weekday  mornings  last  year  the  number  of 
vehicles  passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m. 
was  less  than  675. 

б.  On  at  most  25%  of  the  weekday  mornings  last  year  the  number  of 
vehicles  passing  through  the  intersection  from  8:00  a.m.  to  10:00  a.m. 
was  less  than  675. 

Solution: 

1.  Since  it  is  not  stated  that  the  relative  frequency  histogram  of  the  data  is 
bell-shaped,  the  Empirical  Rule  does  not  apply.  Statement  (l)  is  based  on 
the  Empirical  Rule  and  therefore  it  might  not  be  correct. 

2.  Statement  (2)  is  a  direct  application  of  part  (l)  of  Chebyshev’s  Theorem 
because  (x—2s,  X  +2  s)  =  (675,775)  .  It  must  be  correct. 

3.  Statement  (3)  says  the  same  thing  as  statement  (2)  because  75%  of  251  is 
188.25,  so  the  minimum  whole  number  of  observations  in  this  interval  is 
189.  Thus  statement  (3)  is  definitely  correct. 

4.  Statement  (4)  says  the  same  thing  as  statement  (2)  but  in  different 
words,  and  therefore  is  definitely  correct. 

5.  Statement  (4),  which  is  definitely  correct,  states  that  at  most  25%  of  the 
time  either  fewer  than  675  or  more  than  775  vehicles  passed  through  the 
intersection.  Statement  (5)  says  that  half  of  that  25%  corresponds  to 
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days  of  light  traffic.  This  would  be  correct  if  the  relative  frequency 
histogram  of  the  data  were  known  to  be  symmetric.  But  this  is  not 
stated;  perhaps  all  of  the  observations  outside  the  interval  (675,775)  are 
less  than  75.  Thus  statement  (5)  might  not  be  correct. 

6.  Statement  (4)  is  definitely  correct  and  statement  (4)  implies  statement 
(6):  even  if  every  measurement  that  is  outside  the  interval  (675,775)  is 
less  than  675  (which  is  conceivable,  since  symmetry  is  not  known  to 
hold),  even  so  at  most  25%  of  all  observations  are  less  than  675.  Thus 
statement  (6)  must  definitely  be  correct. 


KEY  TAKEAWAYS 


•  The  Empirical  Rule  is  an  approximation  that  applies  only  to  data  sets 
with  a  bell-shaped  relative  frequency  histogram.  It  estimates  the 
proportion  of  the  measurements  that  lie  within  one,  two,  and  three 
standard  deviations  of  the  mean. 

•  Chebyshev’s  Theorem  is  a  fact  that  applies  to  all  possible  data  sets.  It 
describes  the  minimum  proportion  of  the  measurements  that  lie  must 
within  one,  two,  or  more  standard  deviations  of  the  mean. 
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1.  State  the  Empirical  Rule. 

2.  Describe  the  conditions  under  which  the  Empirical  Rule  may  be  applied. 

3.  State  Chebyshev’s  Theorem. 

4.  Describe  the  conditions  under  which  Chebyshev’s  Theorem  may  be  applied. 

5.  A  sample  data  set  with  a  bell-shaped  distribution  has  mean  X  =  6  and 
standard  deviation  s  =  2.  Find  the  approximate  proportion  of  observations  in 
the  data  set  that  lie: 

a.  between  4  and  8; 

b.  between  2  and  10; 

c.  between  0  and  12. 

6.  A  population  data  set  with  a  bell-shaped  distribution  has  mean  /i  =  6  and 
standard  deviation  a  =2.  Find  the  approximate  proportion  of  observations  in 
the  data  set  that  lie: 

a.  between  4  and  8; 

b.  between  2  and  10; 

c.  between  0  and  12. 

7.  A  population  data  set  with  a  bell-shaped  distribution  has  mean  /i  =  2  and 
standard  deviation  a=  1.1.  Find  the  approximate  proportion  of  observations  in 
the  data  set  that  lie: 

a.  above  2; 

b.  above  3.1; 

c.  between  2  and  3.1. 

8.  A  sample  data  set  with  a  bell-shaped  distribution  has  mean  X  =  2  and 
standard  deviation  s  =  1.1.  Find  the  approximate  proportion  of  observations  in 
the  data  set  that  lie: 

a.  below  -0.2; 

b.  below  3.1; 

c.  between  -1.3  and  0.9. 
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9.  A  population  data  set  with  a  bell-shaped  distribution  and  size  N  =  500  has  mean 
=  2  and  standard  deviation  a  =  1.1.  Find  the  approximate  number  of 
observations  in  the  data  set  that  lie: 

a.  above  2; 

b.  above  3.1; 

c.  between  2  and  3.1. 

10.  A  sample  data  set  with  a  bell-shaped  distribution  and  size  n  =  128  has  mean 
X  =  2  and  standard  deviation  s  =  1.1.  Find  the  approximate  number  of 
observations  in  the  data  set  that  lie: 

a.  below  -0.2; 

b.  below  3.1; 

c.  between  -1.3  and  0.9. 

11.  A  sample  data  set  has  mean  X  =  6  and  standard  deviation  s  =  2.  Find  the 
minimum  proportion  of  observations  in  the  data  set  that  must  lie: 

a.  between  2  and  10; 

b.  between  0  and  12; 

c.  between  4  and  8. 

12.  A  population  data  set  has  mean  ju  =  2  and  standard  deviation  o  =  1.1.  Find  the 
minimum  proportion  of  observations  in  the  data  set  that  must  lie: 

a.  between  -0.2  and  4.2; 

b.  between  -1.3  and  5.3. 

13.  A  population  data  set  of  size  N  =  500  has  mean  fj.  =  5.2  and  standard  deviation  o 
=  1.1.  Find  the  minimum  number  of  observations  in  the  data  set  that  must  lie: 

a.  between  3  and  7.4; 

b.  between  1.9  and  8.5. 

14.  A  sample  data  set  of  size  n  =  128  has  mean  X  —  2  and  standard  deviation  s  =  2. 
Find  the  minimum  number  of  observations  in  the  data  set  that  must  lie: 

a.  between  -2  and  6  (including  -2  and  6); 

b.  between  -4  and  8  (including  -4  and  8). 

15.  A  sample  data  set  of  size  n  =  30  has  mean  X  =  6  and  standard  deviation  s  =  2. 

a.  What  is  the  maximum  proportion  of  observations  in  the  data  set  that  can 
lie  outside  the  interval  (2,10)? 

b.  What  can  be  said  about  the  proportion  of  observations  in  the  data  set  that 
are  below  2? 
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c.  What  can  be  said  about  the  proportion  of  observations  in  the  data  set  that 
are  above  10? 

d.  What  can  be  said  about  the  number  of  observations  in  the  data  set  that  are 
above  10? 

16.  A  population  data  set  has  mean  /i  =  2  and  standard  deviation  a  =  1.1. 

a.  What  is  the  maximum  proportion  of  observations  in  the  data  set  that  can 
lie  outside  the  interval  ( —  1. 3,5.  3)  ? 

b.  What  can  be  said  about  the  proportion  of  observations  in  the  data  set  that 
are  below  -1.3? 

c.  What  can  be  said  about  the  proportion  of  observations  in  the  data  set  that 
are  above  5.3? 


APPLICATIONS 


17.  Scores  on  a  final  exam  taken  by  1,200  students  have  a  bell-shaped  distribution 
with  mean  72  and  standard  deviation  9. 

a.  What  is  the  median  score  on  the  exam? 

b.  About  how  many  students  scored  between  63  and  81? 

c.  About  how  many  students  scored  between  72  and  90? 

d.  About  how  many  students  scored  below  54? 

18.  Lengths  of  fish  caught  by  a  commercial  fishing  boat  have  a  bell-shaped 
distribution  with  mean  23  inches  and  standard  deviation  1.5  inches. 

a.  About  what  proportion  of  all  fish  caught  are  between  20  inches  and  26 
inches  long? 

b.  About  what  proportion  of  all  fish  caught  are  between  20  inches  and  23 
inches  long? 

c.  About  how  long  is  the  longest  fish  caught  (only  a  small  fraction  of  a 
percent  are  longer)? 

19.  Hockey  pucks  used  in  professional  hockey  games  must  weigh  between  5.5  and 
6  ounces,  if  the  weight  of  pucks  manufactured  by  a  particular  process  is  bell¬ 
shaped,  has  mean  5.75  ounces  and  standard  deviation  0.125  ounce,  what 
proportion  of  the  pucks  will  be  usable  in  professional  games? 

20.  Hockey  pucks  used  in  professional  hockey  games  must  weigh  between  5.5  and 
6  ounces,  if  the  weight  of  pucks  manufactured  by  a  particular  process  is  bell¬ 
shaped  and  has  mean  5.75  ounces,  how  large  can  the  standard  deviation  be  if 
99.7%  of  the  pucks  are  to  be  usable  in  professional  games? 
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21.  Speeds  of  vehicles  on  a  section  of  highway  have  a  bell-shaped  distribution  with 
mean  60  mph  and  standard  deviation  2.5  mph. 

a.  if  the  speed  limit  is  55  mph,  about  what  proportion  of  vehicles  are 
speeding? 

b.  What  is  the  median  speed  for  vehicles  on  this  highway? 

c.  What  is  the  percentile  rank  of  the  speed  65  mph? 

d.  What  speed  corresponds  to  the  16th  percentile? 

22.  Suppose  that,  as  in  the  previous  exercise,  speeds  of  vehicles  on  a  section  of 
highway  have  mean  60  mph  and  standard  deviation  2.5  mph,  but  now  the 
distribution  of  speeds  is  unknown. 

a.  If  the  speed  limit  is  55  mph,  at  least  what  proportion  of  vehicles  must 
speeding? 

b.  What  can  be  said  about  the  proportion  of  vehicles  going  65  mph  or  faster? 

23.  An  instructor  announces  to  the  class  that  the  scores  on  a  recent  exam  had  a 
bell-shaped  distribution  with  mean  75  and  standard  deviation  5. 

a.  What  is  the  median  score? 

b.  Approximately  what  proportion  of  students  in  the  class  scored  between  70 
and  80? 

c.  Approximately  what  proportion  of  students  in  the  class  scored  above  85? 

d.  What  is  the  percentile  rank  of  the  score  85? 

24.  The  GPAs  of  all  currently  registered  students  at  a  large  university  have  a  bell¬ 
shaped  distribution  with  mean  2.7  and  standard  deviation  0.6.  Students  with  a 
GPA  below  1.5  are  placed  on  academic  probation.  Approximately  what 
percentage  of  currently  registered  students  at  the  university  are  on  academic 
probation? 

25.  Thirty-six  students  took  an  exam  on  which  the  average  was  80  and  the 
standard  deviation  was  6.  A  rumor  says  that  five  students  had  scores  61  or 
below.  Can  the  rumor  be  true?  Why  or  why  not? 


ADDITIONAL  EXERCISES 


26.  For  the  sample  data 


X 

26 

27 

28 

29 

30 

31 

32 

/ 

3 

4 

16 

12 

6 

2 

1 

Ex  =  1,256  and  Ex2  =  35,926. 
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a.  Compute  the  mean  and  the  standard  deviation. 

b.  About  how  many  of  the  measurements  does  the  Empirical  Rule  predict  will 
be  in  the  interval  (x  —  S,  X  +5)  the  interval  (x  —2 S,  X  +  2 s),  and  the 
interval  (x  —3 S,  X  +  3s)? 

c.  Compute  the  number  of  measurements  that  are  actually  in  each  of  the 
intervals  listed  in  part  (a),  and  compare  to  the  predicted  numbers. 

27.  A  sample  of  size  n  =  80  has  mean  139  and  standard  deviation  13,  but  nothing 

else  is  known  about  it. 

a.  What  can  be  said  about  the  number  of  observations  that  lie  in  the  interval 
(126,152)? 

b.  What  can  be  said  about  the  number  of  observations  that  lie  in  the  interval 
(113,165)? 

c.  What  can  be  said  about  the  number  of  observations  that  exceed  165? 

d.  What  can  be  said  about  the  number  of  observations  that  either  exceed  165 
or  are  less  than  113? 

28.  For  the  sample  data 


X 

1 

2 

3 

4  5 

/ 

84 

29 

3 

3  1 

Ex  =  168  and  Ex2  =  300. 

a.  Compute  the  sample  mean  and  the  sample  standard  deviation. 

b.  Considering  the  shape  of  the  data  set,  do  you  expect  the  Empirical  Rule  to 
apply?  Count  the  number  of  measurements  within  one  standard  deviation 
of  the  mean  and  compare  it  to  the  number  predicted  by  the  Empirical  Rule. 

c.  What  does  Chebyshev’s  Rule  say  about  the  number  of  measurements 
within  one  standard  deviation  of  the  mean? 

d.  Count  the  number  of  measurements  within  two  standard  deviations  of  the 
mean  and  compare  it  to  the  minimum  number  guaranteed  by  Chebyshev’s 
Theorem  to  lie  in  that  interval. 

29.  For  the  sample  data  set 


X 

47 

48 

49 

50 

51 

/ 

1 

3 

18 

2 

1 

Ex  =  1224  and  Ex 2  =  59,940. 

a.  Compute  the  sample  mean  and  the  sample  standard  deviation. 

b.  Considering  the  shape  of  the  data  set,  do  you  expect  the  Empirical  Rule  to 
apply?  Count  the  number  of  measurements  within  one  standard  deviation 
of  the  mean  and  compare  it  to  the  number  predicted  by  the  Empirical  Rule 
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c.  What  does  Chebyshev’s  Rule  say  about  the  number  of  measurements 
within  one  standard  deviation  of  the  mean? 

d.  Count  the  number  of  measurements  within  two  standard  deviations  of  the 
mean  and  compare  it  to  the  minimum  number  guaranteed  by  Chebyshev’s 
Theorem  to  lie  in  that  interval. 
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1. 

See  the  displayed 

3. 

See 

the  displayed 

5. 

a. 

0.68. 

b. 

0.95. 

c. 

0.997. 

7. 

a. 

0.5. 

b. 

0.16. 

c. 

0.34. 

9. 

a. 

250. 

b. 

80. 

c. 

170. 

11. 

a. 

3/4. 

b. 

8/9. 

c. 

0. 

13. 

a. 

375. 

b. 

445. 

15. 

a. 

At  most  0.25. 

b. 

At  most  0.25. 

c. 

At  most  0.25. 

d. 

At  most  7. 

17. 

a. 

72. 

b. 

816. 

c. 

570. 

d. 

30. 

19. 

0.95. 

21. 

a. 

0.975. 

b. 

60. 

c. 

97.5. 

d. 

57.5. 

23. 

a. 

75. 

b. 

0.68. 

c. 

0.025. 

d. 

0.975. 
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25.  By  Chebyshev’s  Theorem  at  most  1/9  of  the  scores  can  be  below  62,  so  the 
rumor  is  impossible. 


27. 


29. 


a.  Nothing. 

b.  It  is  at  least  60. 

c.  It  is  at  most  20. 

d.  It  is  at  most  20. 

a.  X  =  48.96,  s  =  0.7348. 

b.  Roughly  bell-shaped,  the  Empirical  Rule  should  apply.  True  count:  18, 
predicted:  17. 

c.  Nothing. 

d.  True  count:  23,  guaranteed:  at  least  18.75,  hence  at  least  19. 
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Suppose  a  polling  organization  questions  1,200  voters  in  order  to  estimate  the 
proportion  of  all  voters  who  favor  a  particular  bond  issue.  We  would  expect  the 
proportion  of  the  1,200  voters  in  the  survey  who  are  in  favor  to  be  close  to  the 
proportion  of  all  voters  who  are  in  favor,  but  this  need  not  be  true.  There  is  a 
degree  of  randomness  associated  with  the  survey  result,  if  the  survey  result  is 
highly  likely  to  be  close  to  the  true  proportion,  then  we  have  confidence  in  the 
survey  result,  if  it  is  not  particularly  likely  to  be  close  to  the  population  proportion, 
then  we  would  perhaps  not  take  the  survey  result  too  seriously.  The  likelihood  that 
the  survey  proportion  is  close  to  the  population  proportion  determines  our 
confidence  in  the  survey  result.  For  that  reason,  we  would  like  to  be  able  to 
compute  that  likelihood.  The  task  of  computing  it  belongs  to  the  realm  of 
probability,  which  we  study  in  this  chapter. 
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3.1  Sample  Spaces,  Events,  and  Their  Probabilities 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  sample  space  associated  with  a  random 
experiment. 

2.  To  learn  the  concept  of  an  event  associated  with  a  random  experiment. 

3.  To  learn  the  concept  of  the  probability  of  an  event. 


Sample  Spaces  and  Events 

Rolling  an  ordinary  six-sided  die  is  a  familiar  example  of  a  random  experiment,  an 
action  for  which  all  possible  outcomes  can  be  listed,  but  for  which  the  actual 
outcome  on  any  given  trial  of  the  experiment  cannot  be  predicted  with  certainty.  In 
such  a  situation  we  wish  to  assign  to  each  outcome,  such  as  rolling  a  two,  a  number, 
called  the  probability  of  the  outcome,  that  indicates  how  likely  it  is  that  the  outcome 
will  occur.  Similarly,  we  would  like  to  assign  a  probability  to  any  event,  or  collection 
of  outcomes,  such  as  rolling  an  even  number,  which  indicates  how  likely  it  is  that 
the  event  will  occur  if  the  experiment  is  performed.  This  section  provides  a 
framework  for  discussing  probability  problems,  using  the  terms  just  mentioned. 


1.  The  set  of  all  possible  outcomes 
of  a  random  experiment. 


2.  Any  set  of  outcomes. 
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EXAMPLE  1 


Construct  a  sample  space  for  the  experiment  that  consists  of  tossing  a  single 
coin. 

Solution: 

The  outcomes  could  be  labeled  h  for  heads  and  t  for  tails.  Then  the  sample 
space  is  the  set  S  =  {h,t}. 


EXAMPLE  2 


Construct  a  sample  space  for  the  experiment  that  consists  of  rolling  a  single 
die.  Find  the  events  that  correspond  to  the  phrases  “an  even  number  is 
rolled”  and  “a  number  greater  than  two  is  rolled.” 

Solution: 

The  outcomes  could  be  labeled  according  to  the  number  of  dots  on  the  top 
face  of  the  die.  Then  the  sample  space  is  the  set  S  =  { 1,2, 3, 4, 5, 6}  . 

The  outcomes  that  are  even  are  2,  4,  and  6,  so  the  event  that  corresponds  to 
the  phrase  “an  even  number  is  rolled”  is  the  set  {2,4,6},  which  it  is  natural  to 
denote  by  the  letter  E.  We  write  E  =  { 2,4,6 }  . 

Similarly  the  event  that  corresponds  to  the  phrase  “a  number  greater  than 
two  is  rolled”  is  the  set  T  =  {3, 4, 5, 6}  ,  which  we  have  denoted  T. 


A  graphical  representation  of  a  sample  space  and  events  is  a  Venn  diagram,  as 
shown  in  Figure  3.1  "Venn  Diagrams  for  Two  Sample  Spaces"  for  Note  3.6  "Example 
and  Note  3.7  "Example  2".  In  general  the  sample  space  S  is  represented  by  a 
rectangle,  outcomes  by  points  within  the  rectangle,  and  events  by  ovals  that 
enclose  the  outcomes  that  compose  them. 
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Figure  3.1  Venn  Diagrams  for  Two  Sample  Spaces 


EXAMPLE  3 


A  random  experiment  consists  of  tossing  two  coins. 

a.  Construct  a  sample  space  for  the  situation  that  the  coins  are 
indistinguishable,  such  as  two  brand  new  pennies. 

b.  Construct  a  sample  space  for  the  situation  that  the  coins  are 
distinguishable,  such  as  one  a  penny  and  the  other  a  nickel. 

Solution: 

a.  After  the  coins  are  tossed  one  sees  either  two  heads,  which  could  be 
labeled  2 h ,  two  tails,  which  could  be  labeled  2 1,  or  coins  that  differ, 
which  could  be  labeled  d.  Thus  a  sample  space  is  S  =  { 2 h,  2 1,  d}  . 

b.  Since  we  can  tell  the  coins  apart,  there  are  now  two  ways  for  the  coins  to 
differ:  the  penny  heads  and  the  nickel  tails,  or  the  penny  tails  and  the 
nickel  heads.  We  can  label  each  outcome  as  a  pair  of  letters,  the  first  of 
which  indicates  how  the  penny  landed  and  the  second  of  which 
indicates  how  the  nickel  landed.  A  sample  space  is  then 

S'  =  {hh,ht,th,tt}. 


A  device  that  can  be  helpful  in  identifying  all  possible  outcomes  of  a  random 
experiment,  particularly  one  that  can  be  viewed  as  proceeding  in  stages,  is  what  is 
called  a  tree  diagram.  It  is  described  in  the  following  example. 
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EXAMPLE  4 


Construct  a  sample  space  that  describes  all  three-child  families  according  to 
the  genders  of  the  children  with  respect  to  birth  order. 

Solution: 

Two  of  the  outcomes  are  “two  boys  then  a  girl,”  which  we  might  denote 
bbg ,  and  “a  girl  then  two  boys,”  which  we  would  denote  gbb.  Clearly 
there  are  many  outcomes,  and  when  we  try  to  list  all  of  them  it  could  be 
difficult  to  be  sure  that  we  have  found  them  all  unless  we  proceed 
systematically.  The  tree  diagram  shown  in  Figure  3.2  "Tree  Diagram  For 
Three-Child  Families",  gives  a  systematic  approach. 

Figure  3.2 

Tree  Diagram  For 
Three-Child  Families 

b  bbb 

.9  bbg 

b  bgb 

9  bgg 

b  gbb 

9  gbg 

b  ggb 

9  999 


The  diagram  was  constructed  as  follows.  There  are  two  possibilities  for  the 
first  child,  boy  or  girl,  so  we  draw  two  line  segments  coming  out  of  a  starting 
point,  one  ending  in  a  b  for  “boy”  and  the  other  ending  in  a  g  for  “girl.”  For 
each  of  these  two  possibilities  for  the  first  child  there  are  two  possibilities 
for  the  second  child,  “boy”  or  “girl,”  so  from  each  of  the  b  and  g  we  draw 
two  line  segments,  one  segment  ending  in  a  b  and  one  in  a  g.  For  each  of  the 
four  ending  points  now  in  the  diagram  there  are  two  possibilities  for  the 
third  child,  so  we  repeat  the  process  once  more. 

The  line  segments  are  called  branches  of  the  tree.  The  right  ending  point  of 
each  branch  is  called  a  node.  The  nodes  on  the  extreme  right  are  the  final 
nodes;  to  each  one  there  corresponds  an  outcome,  as  shown  in  the  figure. 
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From  the  tree  it  is  easy  to  read  off  the  eight  outcomes  of  the  experiment,  so 
the  sample  space  is,  reading  from  the  top  to  the  bottom  of  the  final  nodes  in 
the  tree, 

S  =  {bbb,  bbg,  bgb,  bgg,  gbb,  gbg,  ggb,  ggg } 


Probability 


The  following  formula  expresses  the  content  of  the  definition  of  the  probability  of 
an  event: 


3.  A  number  that  measures  the 
likelihood  of  the  outcome. 

4.  A  number  that  measures  the 
likelihood  of  the  event. 


If  an  event  E  is  E  —  {e\,e2 ,  . . .  ,  },  then 

P(E)  =  P(ei)+P(e2)+  • 

•  +  P  (ek) 

Figure  3.3  "Sample  Spaces  and  Probability"  graphically  illustrates  the  definitions. 
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Figure  3.3  Sample  Spaces  and  Probability 

Pi  P2  P3 


A:  {ei,e2} 

B:  {e2,e3,e4} 

P(A)  =  pi+p2 
P(B)  =  P2+P3+P4 


Since  the  whole  sample  space  S  is  an  event  that  is  certain  to  occur,  the  sum  of  the 
probabilities  of  all  the  outcomes  must  be  the  number  1. 


In  ordinary  language  probabilities  are  frequently  expressed  as  percentages.  For 
example,  we  would  say  that  there  is  a  70%  chance  of  rain  tomorrow,  meaning  that 
the  probability  of  rain  is  0.70.  We  will  use  this  practice  here,  but  in  all  the 
computational  formulas  that  follow  we  will  use  the  form  0.70  and  not  70%. 


EXAMPLE  5 


A  coin  is  called  “balanced”  or  “fair”  if  each  side  is  equally  likely  to  land  up. 
Assign  a  probability  to  each  outcome  in  the  sample  space  for  the  experiment 
that  consists  of  tossing  a  single  fair  coin. 

Solution: 

With  the  outcomes  labeled  h  for  heads  and  t  for  tails,  the  sample  space  is  the 
set  S  =  { h,  t }  .  Since  the  outcomes  have  the  same  probabilities,  which 
must  add  up  to  1,  each  outcome  is  assigned  probability  l/2. 
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EXAMPLE  6 


A  die  is  called  “balanced”  or  “fair”  if  each  side  is  equally  likely  to  land  on 
top.  Assign  a  probability  to  each  outcome  in  the  sample  space  for  the 
experiment  that  consists  of  tossing  a  single  fair  die.  Find  the  probabilities  of 
the  events  E:  “an  even  number  is  rolled”  and  T :  “a  number  greater  than  two 
is  rolled.” 

Solution: 

With  outcomes  labeled  according  to  the  number  of  dots  on  the  top  face  of 
the  die,  the  sample  space  is  the  set  S  =  {  1,2, 3, 4,5,6}  .  Since  there  are 
six  equally  likely  outcomes,  which  must  add  up  to  1,  each  is  assigned 
probability  l/6. 

Since  E  =  {2,4,6}  , 

P(E)  =  1  /  6  +  1  /  6  +  1  /  6  =  3  /  6  =  1  /  2. 
since  T  =  {3, 4, 5, 6}  , P  (T)  =  4  /  6  =  2  /  3. 
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Two  fair  coins  are  tossed.  Find  the  probability  that  the  coins  match,  i.e., 
either  both  land  heads  or  both  land  tails. 

Solution: 

In  Note  3.8  "Example  3"  we  constructed  the  sample  space 
S  =  {  2h,  2 1,  d  }  for  the  situation  in  which  the  coins  are  identical  and  the 
sample  space  S'  =  { hh,  ht,  th,  tt }  for  the  situation  in  which  the  two 
coins  can  be  told  apart. 

The  theory  of  probability  does  not  tell  us  how  to  assign  probabilities  to  the 
outcomes,  only  what  to  do  with  them  once  they  are  assigned.  Specifically, 
using  sample  space  S,  matching  coins  is  the  event  M  =  { 2 h,  2 1 } ,  which 
has  probability  P  (2 /z)  +  P  (2 1)  .  Using  sample  space  S' ,  matching  coins 

is  the  event  M'  =  { hh,  tt } ,  which  has  probability  P  (hll)  +  P  (tt)  .  In 
the  physical  world  it  should  make  no  difference  whether  the  coins  are 
identical  or  not,  and  so  we  would  like  to  assign  probabilities  to  the  outcomes 
so  that  the  numbers  P  (M)  and  P  ( M ')  are  the  same  and  best  match  what 
we  observe  when  actual  physical  experiments  are  performed  with  coins  that 
seem  to  be  fair.  Actual  experience  suggests  that  the  outcomes  in  S'  are 
equally  likely,  so  we  assign  to  each  probability  1/4,  and  then 


Similarly,  from  experience  appropriate  choices  for  the  outcomes  in  S  are: 


which  give  the  same  final  answer 


The  previous  three  examples  illustrate  how  probabilities  can  be  computed  simply 


by  counting  when  the  sample  space  consists  of  a  finite  number  of  equally  likely 
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outcomes.  In  some  situations  the  individual  outcomes  of  any  sample  space  that 
represents  the  experiment  are  unavoidably  unequally  likely,  in  which  case 
probabilities  cannot  be  computed  merely  by  counting,  but  the  computational 
formula  given  in  the  definition  of  the  probability  of  an  event  must  be  used. 


EXAMPLE  8 


The  breakdown  of  the  student  body  in  a  local  high  school  according  to  race 
and  ethnicity  is  51%  white,  27%  black,  11%  Hispanic,  6%  Asian,  and  5%  for  all 
others.  A  student  is  randomly  selected  from  this  high  school.  (To  select 
“randomly”  means  that  every  student  has  the  same  chance  of  being 
selected.)  Find  the  probabilities  of  the  following  events: 

a.  B:  the  student  is  black, 

b.  M:  the  student  is  minority  (that  is,  not  white), 

c.  N:  the  student  is  not  black. 

Solution: 

The  experiment  is  the  action  of  randomly  selecting  a  student  from  the 
student  population  of  the  high  school.  An  obvious  sample  space  is 
S  =  {w,  b,  h,  a,  o}  .  Since  51%  of  the  students  are  white  and  all 
students  have  the  same  chance  of  being  selected,  P  ( W )  =  0.51,  and 
similarly  for  the  other  outcomes.  This  information  is  summarized  in  the 
following  table: 


Outcome 

w  b  h  a  o 

Probability 

0.51  0.27  0.11  0.06  0.05 

a.  Since B  =  { b},P(B )  =  P  (b)  =  0.27. 

b.  Since M  =  {b,  h,  a,  o} , 

P(M)  =  P  (b)  +P(h)  +  P  (a)  +  P  (o)  =  0.27  +  0.11  +0.06  +  0.05 

c.  Since =  {w,h,a,o}, 

P(N)  =  P(w)  +  P  [h]  +  P  (a)  +  P  (o)  =  0.51  +  0.11+0.06  +  0.05 
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EXAMPLE  9 


The  student  body  in  the  high  school  considered  in  Note  3.18  "Example  8" 
may  be  broken  down  into  ten  categories  as  follows:  25%  white  male,  26% 
white  female,  12%  black  male,  15%  black  female,  6%  Hispanic  male,  5% 
Hispanic  female,  3%  Asian  male,  3%  Asian  female,  1%  male  of  other 
minorities  combined,  and  4%  female  of  other  minorities  combined.  A 
student  is  randomly  selected  from  this  high  school.  Find  the  probabilities  of 
the  following  events: 

a.  B:  the  student  is  black, 

b.  MF:  the  student  is  minority  female, 

c.  FN:  the  student  is  female  and  is  not  black. 

Solution: 

Now  the  sample  space  is 

S  =  {wm,  bm,  hm,  am,  om,  wf,  bf,  hf,  af,  of}  .  The 
information  given  in  the  example  can  be  summarized  in  the  following  table, 
called  a  two-way  contingency  table: 


Gender 

Race  /  Ethnicity 

White 

Black 

Hispanic 

Asian 

Others 

Male 

0.25 

0.12 

0.06 

0.03 

0.01 

Female 

0.26 

0.15 

0.05 

0.03 

0.04 

a.  Since  B  =  {b  m,  bf}, 

P(B)  =  P  (bm)  +  P  (bf)  =  0.12  +  0.15  =  0.27. 

b.  Since  MF  =  {bf ,hf ,  af ,  of} , 

P  (M)  =  P  (bf)  +  P  (hf)  +  P  (af)  +  P  (of)  =  0.15  +  0.05  +  0.03  H 

c.  Since  FN  —  {wf ,  hf ,  af ,  of}  , 

P  (FN)  =  P  (wf)  +  P  (hf)  +  P  (af)  +  P  (of)  =  0.26  +  0.05  +  0.03 
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KEY  TAKEAWAYS 


•  The  sample  space  of  a  random  experiment  is  the  collection  of  all 
possible  outcomes. 

•  An  event  associated  with  a  random  experiment  is  a  subset  of  the  sample 
space. 

•  The  probability  of  any  outcome  is  a  number  between  0  and  1.  The 
probabilities  of  all  the  outcomes  add  up  to  1. 

•  The  probability  of  any  event  A  is  the  sum  of  the  probabilities  of  the 
outcomes  in  A. 
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1.  A  box  contains  10  white  and  10  black  marbles.  Construct  a  sample  space  for  the 
experiment  of  randomly  drawing  out,  with  replacement,  two  marbles  in 
succession  and  noting  the  color  each  time.  (To  draw  “with  replacement”  means 
that  the  first  marble  is  put  back  before  the  second  marble  is  drawn.) 

2.  A  box  contains  16  white  and  16  black  marbles.  Construct  a  sample  space  for  the 
experiment  of  randomly  drawing  out,  with  replacement,  three  marbles  in 
succession  and  noting  the  color  each  time.  (To  draw  “with  replacement”  means 
that  each  marble  is  put  back  before  the  next  marble  is  drawn.) 

3.  A  box  contains  8  red,  8  yellow,  and  8  green  marbles.  Construct  a  sample  space 
for  the  experiment  of  randomly  drawing  out,  with  replacement,  two  marbles  in 
succession  and  noting  the  color  each  time. 

4.  A  box  contains  6  red,  6  yellow,  and  6  green  marbles.  Construct  a  sample  space 
for  the  experiment  of  randomly  drawing  out,  with  replacement,  three  marbles 
in  succession  and  noting  the  color  each  time. 

5.  In  the  situation  of  Exercise  1,  list  the  outcomes  that  comprise  each  of  the 
following  events. 

a.  At  least  one  marble  of  each  color  is  drawn. 

b.  No  white  marble  is  drawn. 

6.  In  the  situation  of  Exercise  2,  list  the  outcomes  that  comprise  each  of  the 
following  events. 

a.  At  least  one  marble  of  each  color  is  drawn. 

b.  No  white  marble  is  drawn. 

c.  More  black  than  white  marbles  are  drawn. 

7.  In  the  situation  of  Exercise  3,  list  the  outcomes  that  comprise  each  of  the 
following  events. 

a.  No  yellow  marble  is  drawn. 

b.  The  two  marbles  drawn  have  the  same  color. 

c.  At  least  one  marble  of  each  color  is  drawn. 

8.  In  the  situation  of  Exercise  4,  list  the  outcomes  that  comprise  each  of  the 
following  events. 

a.  No  yellow  marble  is  drawn. 
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b.  The  three  marbles  drawn  have  the  same  color. 

c.  At  least  one  marble  of  each  color  is  drawn. 

9.  Assuming  that  each  outcome  is  equally  likely,  find  the  probability  of  each 
event  in  Exercise  5. 

10.  Assuming  that  each  outcome  is  equally  likely,  find  the  probability  of  each 
event  in  Exercise  6. 

11.  Assuming  that  each  outcome  is  equally  likely,  find  the  probability  of  each 
event  in  Exercise  7. 

12.  Assuming  that  each  outcome  is  equally  likely,  find  the  probability  of  each 
event  in  Exercise  8. 

13.  A  sample  space  is  S  =  {a,  b,  C,  d,  e}  .  Identify  two  events  as 

U  =  {a,b,d}  andV  =  {b,  C,  d)  .  Suppose  P  (fl)andP  (£>)  are  each 
0.2  and  P  (c)  and  P  (i/)  are  each  0.1. 

a.  Determine  what  P  (e)  must  be. 

b.  FindP  (£/)  . 

c.  FindP  (V)  . 

14.  A  sample  space  is  S  =  {u,V,W,x}  .  Identify  two  events  as  A  =  {  V,  W  } 
and  B  =  {u,W,x}.  Suppose  Pill)  =  0.22,  P  (w)  =  0.36,  and 
P(x)  =  0.27. 

a.  Determine  what  P  (v)  must  be. 

b.  FindP  (A)  . 

c.  FindP  ( B )  . 

15.  A  sample  space  is  S  =  { m,  fl,  q,  r,  y  j  .  Identify  two  events  as 

U  =  {m,  q,  yj  and  V  =  r|  .  The  probabilities  of  some  of  the 

outcomes  are  given  by  the  following  table: 


Outcome 

m  n  q  r  s 

Probablity 

0.18  0.16  0.24  0.21 

a.  Determine  what  P  [qj  must  be. 

b.  FindP  ( U )  . 

c.  FindP  (V)  . 

16.  A  sample  space  is  S  =  j<7,  e,f,  g,h}  .  Identify  two  events  as 

M  =  |e,/,  g,  /?|  andiV  =  [d,  g}  .  The  probabilities  of  some  of  the 
outcomes  are  given  by  the  following  table: 
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Outcome 

d  e 

/ 

g  h 

Probablity 

0.22  0.13 

0.27 

0.19 

a.  Determine  what  P  (g)  must  be. 

b.  Find  P  ( M )  . 

c.  FindP  (TV)  . 


APPLICATIONS 


17.  The  sample  space  that  describes  all  three-child  families  according  to  the 
genders  of  the  children  with  respect  to  birth  order  was  constructed  in  Note  3.9 
"Example  4".  Identify  the  outcomes  that  comprise  each  of  the  following  events 
in  the  experiment  of  selecting  a  three-child  family  at  random. 

a.  At  least  one  child  is  a  girl. 

b.  At  most  one  child  is  a  girl. 

c.  All  of  the  children  are  girls. 

d.  Exactly  two  of  the  children  are  girls. 

e.  The  first  born  is  a  girl. 

18.  The  sample  space  that  describes  three  tosses  of  a  coin  is  the  same  as  the  one 
constructed  in  Note  3.9  "Example  4"  with  “boy”  replaced  by  “heads”  and  “girl” 
replaced  by  “tails.”  Identify  the  outcomes  that  comprise  each  of  the  following 
events  in  the  experiment  of  tossing  a  coin  three  times. 

a.  The  coin  lands  heads  more  often  than  tails. 

b.  The  coin  lands  heads  the  same  number  of  times  as  it  lands  tails. 

c.  The  coin  lands  heads  at  least  twice. 

d.  The  coin  lands  heads  on  the  last  toss. 

19.  Assuming  that  the  outcomes  are  equally  likely,  find  the  probability  of  each 
event  in  Exercise  17. 

20.  Assuming  that  the  outcomes  are  equally  likely,  find  the  probability  of  each 
event  in  Exercise  18. 


ADDITIONAL  EXERCISES 


21.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 
population  in  a  particular  locale  according  to  age  and  tobacco  usage: 
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Age 

Tobacco  Use 

Smoker 

Non-smoker 

Under  30 

0.05 

0.20 

Over  30 

0.20 

0.55 

A  person  is  selected  at  random.  Find  the  probability  of  each  of  the  following 
events. 

a.  The  person  is  a  smoker. 

b.  The  person  is  under  30. 

c.  The  person  is  a  smoker  who  is  under  30. 

22.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 

population  in  a  particular  locale  according  to  party  affiliation  (A,  B,  C,  or  None ) 
and  opinion  on  a  bond  issue: 


Affiliation 

Opinion 

Favors 

Opposes 

Undecided 

A 

0.12 

0.09 

0.07 

B 

0.16 

0.12 

0.14 

C 

0.04 

0.03 

0.06 

None 

0.08 

0.06 

0.03 

A  person  is  selected  at  random.  Find  the  probability  of  each  of  the  following 
events. 

a.  The  person  is  affiliated  with  party  B. 

b.  The  person  is  affiliated  with  some  party. 

c.  The  person  is  in  favor  of  the  bond  issue. 

d.  The  person  has  no  party  affiliation  and  is  undecided  about  the  bond  issue. 

23.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 

population  of  married  or  previously  married  women  beyond  child-bearing  age 
in  a  particular  locale  according  to  age  at  first  marriage  and  number  of 
children: 


Age 

Number  of  Children 

0 

1  or  2 

3  or  More 

Under  20 

0.02 

0.14 

0.08 

20-29 

0.07 

0.37 

0.11 
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Age 

Number  of  Children 

0 

1  or  2 

3  or  More 

30  and  above 

0.10 

0.10 

0.01 

A  woman  is  selected  at  random.  Find  the  probability  of  each  of  the  following 
events. 

a.  The  woman  was  in  her  twenties  at  her  first  marriage. 

b.  The  woman  was  20  or  older  at  her  first  marriage. 

c.  The  woman  had  no  children. 

d.  The  woman  was  in  her  twenties  at  her  first  marriage  and  had  at  least  three 
children. 

24.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 
population  of  adults  in  a  particular  locale  according  to  highest  level  of 
education  and  whether  or  not  the  individual  regularly  takes  dietary 
supplements: 


Education 

Use  of  Supplements 

Takes 

Does  Not  Take 

No  High  School  Diploma 

0.04 

0.06 

High  School  Diploma 

0.06 

0.44 

Undergraduate  Degree 

0.09 

0.28 

Graduate  Degree 

0.01 

0.02 

An  adult  is  selected  at  random.  Find  the  probability  of  each  of  the  following 
events. 

a.  The  person  has  a  high  school  diploma  and  takes  dietary  supplements 
regularly. 

b.  The  person  has  an  undergraduate  degree  and  takes  dietary  supplements 
regularly. 

c.  The  person  takes  dietary  supplements  regularly. 

d.  The  person  does  not  take  dietary  supplements  regularly. 


LARGE  DATA  SET  EXERCISES 


25.  Large  Data  Sets  4  and  4A  record  the  results  of  500  tosses  of  a  coin.  Find  the 
relative  frequency  of  each  outcome  1,  2,  3,  4,  5,  and  6.  Does  the  coin  appear  to 
be  “balanced”  or  “fair”? 
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http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4A.xls 

26.  Large  Data  Sets  6,  6A,  and  6B  record  results  of  a  random  survey  of  200  voters  in 

each  of  two  regions,  in  which  they  were  asked  to  express  whether  they  prefer 

Candidate  A  for  a  U.S.  Senate  seat  or  prefer  some  other  candidate. 

a.  Find  the  probability  that  a  randomly  selected  voter  among  these  400 
prefers  Candidate  A. 

b.  Find  the  probability  that  a  randomly  selected  voter  among  the  200  who 
live  in  Region  1  prefers  Candidate  A  (separately  recorded  in  Large  Data  Set 
6A). 

c.  Find  the  probability  that  a  randomly  selected  voter  among  the  200  who 
live  in  Region  2  prefers  Candidate  A  (separately  recorded  in  Large  Data  Set 
6B). 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data6.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data6A.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/data6B.xls 
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ANSWERS 


i.  S  =  [bb,  bw,wb,ww} 

3.  S  =  {rr,  ry,  rg,  yr,  yy,  yg,  gr,  gy,  gg } 

5.  a.  {bw,wb} 

b.  {bb} 

7.  a.  { rr,rg,gr,gg } 

b.  { rr,yy,gg } 

c.  0 

9.  a.  2/4 

b.  1/4 

11.  a.  4/9 

b.  3/9 

c.  0 

13.  a.  0.4 

b.  0.5 

c.  0.4 

15.  a.  0.21 

b.  0.6 

c.  0.61 

17.  a.  {bbg,  bgb,  bgg,  gbb,  gbg,  ggb,  ggg } 

b.  {  bbb,  bbg,  bgb,  gbb  } 
c-  {ggg} 

d.  {bgg,  gbg,  ggb) 

e.  [gbb,  gbg,  ggb,  ggg) 

19.  a.  7/8 

b.  4/8 

c.  1/8 

d.  3/8 

e.  4/8 

21.  a.  0.25 

b.  0.25 

c.  0.05 
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23.  a.  0.55 

b.  0.76 

c.  0.19 

d.  0.11 


25.  The  relative  frequencies  for  1  through  6  are  0.16,  0.194,  0.162,  0.164,  0.154  and 
0.166.  It  would  appear  that  the  die  is  not  balanced. 
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3.2  Complements,  Intersections,  and  Unions 


LEARNING  OBJECTIVES 

1.  To  learn  how  some  events  are  naturally  expressible  in  terms  of  other 
events. 

2.  To  learn  how  to  use  special  formulas  for  the  probability  of  an  event  that 
is  expressed  in  terms  of  one  or  more  other  events. 


Some  events  can  be  naturally  expressed  in  terms  of  other,  sometimes  simpler, 
events. 

Complements 


Definition 

The  complement  of  an  event5  A  in  a  sample  space  S,  denoted  Ac,  is  the  collection  of 
all  outcomes  in  S  that  are  not  elements  of  the  set  A.  It  corresponds  to  negating  any 
description  in  words  of  the  event  A. 


5.  The  event  does  not  occur. 
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EXAMPLE  10 


Two  events  connected  with  the  experiment  of  rolling  a  single  die  are  E:  “the 
number  rolled  is  even”  and  T:  “the  number  rolled  is  greater  than  two.”  Find 
the  complement  of  each. 

Solution: 

In  the  sample  space  S  =  {  1,2, 3,4,5, 6j  the  corresponding  sets  of 
outcomes  areE  =  {2,4,6}  and  T  =  |3,4,5,6|  .  The  complements 
are  is c  =  { 1,3,5}  and  Tc  =  {1,2}. 

In  words  the  complements  are  described  by  “the  number  rolled  is  not  even” 
and  “the  number  rolled  is  not  greater  than  two.”  Of  course  easier 
descriptions  would  be  “the  number  rolled  is  odd”  and  “the  number  rolled  is 
less  than  three.” 


if  there  is  a  60%  chance  of  rain  tomorrow,  what  is  the  probability  of  fair  weather? 
The  obvious  answer,  40%,  is  an  instance  of  the  following  general  rule. 


Probability  Rule  for  Complements 

p(Ac)  =  1  -P(A) 


This  formula  is  particularly  useful  when  finding  the  probability  of  an  event  directly 
is  difficult. 
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EXAMPLE  11 


Find  the  probability  that  at  least  one  heads  will  appear  in  five  tosses  of  a  fair 
coin. 

Solution: 

Identify  outcomes  by  lists  of  five  hs  and  ts,  such  as  tthtt  and  hhttt. 

Although  it  is  tedious  to  list  them  all,  it  is  not  difficult  to  count  them.  Think 
of  using  a  tree  diagram  to  do  so.  There  are  two  choices  for  the  first  toss.  For 
each  of  these  there  are  two  choices  for  the  second  toss,  hence  2x2  =  4 
outcomes  for  two  tosses.  For  each  of  these  four  outcomes,  there  are  two 
possibilities  for  the  third  toss,  hence  4x2  =  8  outcomes  for  three  tosses. 
Similarly,  there  are  8x2=  16  outcomes  for  four  tosses  and  finally 
16  X  2  =  32  outcomes  for  five  tosses. 

Let  0  denote  the  event  “at  least  one  heads.”  There  are  many  ways  to  obtain 
at  least  one  heads,  but  only  one  way  to  fail  to  do  so:  all  tails.  Thus  although  it 
is  difficult  to  list  all  the  outcomes  that  form  0,  it  is  easy  to  write 
Oc  =  { ttttt }  .Since  there  are  32  equally  likely  outcomes,  each  has 
probability  l/ 32,  so  P  )  =  1  /  32,  hence 

P  (O)  =  1  —  1/32  «  0.97  or  about  a  97%  chance. 


Intersection  of  Events 


Definition 

The  intersection  of  events6  A  and  B,  denoted  A  IT  B,  is  the  collection  of  all  outcomes 
that  are  elements  of  both  of  the  sets  A  and  B.  It  corresponds  to  combining  descriptions 
of  the  two  events  using  the  word  “and.” 


To  say  that  the  event  A  IT  B  occurred  means  that  on  a  particular  trial  of  the 
experiment  both  A  and  B  occurred.  A  visual  representation  of  the  intersection  of 
events  A  and  B  in  a  sample  space  S  is  given  in  Figure  3.4  "The  Intersection  of  Events 
".  The  intersection  corresponds  to  the  shaded  lens-shaped  region  that  lies  within 
both  ovals. 

6.  Both  events  occur. 
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Figure  3.4  The  Intersection 
of  Events  A  and  B 


EXAMPLE  12 


In  the  experiment  of  rolling  a  single  die,  find  the  intersection  £  Cl  T  of  the 
events  £:  “the  number  rolled  is  even”  and  T :  “the  number  rolled  is  greater 
than  two.” 

Solution: 

The  sample  space  is  S  =  { 1,2, 3,4,5, 6j  .  Since  the  outcomes  that  are 
common  to  E  =  {2,4,6}  and  T  =  {3, 4, 5, 6}  are  4  and  6, 
EnT=  {4,6}. 

In  words  the  intersection  is  described  by  “the  number  rolled  is  even  and  is 
greater  than  two.”  The  only  numbers  between  one  and  six  that  are  both 
even  and  greater  than  two  are  four  and  six,  corresponding  to  £  IT  T  given 
above. 
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EXAMPLE  13 


A  single  die  is  rolled. 

a.  Suppose  the  die  is  fair.  Find  the  probability  that  the  number  rolled  is 
both  even  and  greater  than  two. 

b.  Suppose  the  die  has  been  “loaded”  so  that  P  (1)  =  1/12, 

P  (6)  =3  j  12,  and  the  remaining  four  outcomes  are  equally  likely 
with  one  another.  Now  find  the  probability  that  the  number  rolled  is 
both  even  and  greater  than  two. 

Solution: 

In  both  cases  the  sample  space  is  S  =  j  1,2, 3,4,5, 6j  and  the  event  in 
question  is  the  intersection  E  D  T  =  {4,6}  of  the  previous  example. 

a.  Since  the  die  is  fair,  all  outcomes  are  equally  likely,  so  by  counting  we 
have  P  (E  fl  T)  =  2  /  6. 

b.  The  information  on  the  probabilities  of  the  six  outcomes  that  we 
have  so  far  is 


Outcome 

1  2  3  4  5 

6 

Probablity 

1 

12  P  P  P  P 

3 

n 

Since  P(l)  +  P  (6)  =  4  j  12  =  1  j  3  and  the  probabilities  of  all  six 
outcomes  add  up  to  1, 

P(2)  +  P(3)  +  />(4)  +  />(5)  =  1  — |  =  | 

Thus  4 p  =  2  /  3,so p  —  1  /  6.  In  particular  P  (4)  =1/6. 
Therefore 

P(EnT)  =  P(4)  +  P{6)  =  ^  +  = 
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Definition 

Events  A  and  B  are  mutually  exclusive7  if  they  have  no  elements  in  common. 


For  A  and  B  to  have  no  outcomes  in  common  means  precisely  that  it  is  impossible 
for  both  A  and  B  to  occur  on  a  single  trial  of  the  random  experiment.  This  gives  the 
following  rule. 


Probability  Rule  for  Mutually  Exclusive  Events 

Events  A  and  B  are  mutually  exclusive  if  and  only  if 

P(AnB)  =  0 


Any  event  A  and  its  complement  Ac  are  mutually  exclusive,  but  A  and  B  can  be 
mutually  exclusive  without  being  complements. 


EXAMPLE  14 


In  the  experiment  of  rolling  a  single  die,  find  three  choices  for  an  event  A  so 
that  the  events  A  and  E:  “the  number  rolled  is  even”  are  mutually  exclusive. 

Solution: 

Since  E  =  {2,4,6}  and  we  want  A  to  have  no  elements  in  common  with 
E,  any  event  that  does  not  contain  any  even  number  will  do.  Three  choices 
are  {1,3,5}  (the  complement  Ec,  the  odds),  {l,3},  and  {5}. 


7.  Events  that  cannot  both  occur 
at  once. 
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Union  of  Events 


Definition 

The  union  of  events8  A  and  B,  denoted  A  U  B,  is  the  collection  of  all  outcomes  that 
are  elements  of  one  or  the  other  of  the  sets  A  and  B,  or  of  both  of  them.  It  corresponds  to 
combining  descriptions  of  the  two  events  using  the  word  “or.” 


To  say  that  the  event  A  U  B  occurred  means  that  on  a  particular  trial  of  the 
experiment  either  A  or  B  occurred  (or  both  did).  A  visual  representation  of  the 
union  of  events  A  and  B  in  a  sample  space  5  is  given  in  Figure  3.5  "The  Union  of 
Events  ".  The  union  corresponds  to  the  shaded  region. 


Figure  3.5  The  Union  of 
Events  A  and  B 


8.  One  or  the  other  event  occurs. 
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EXAMPLE  15 


In  the  experiment  of  rolling  a  single  die,  find  the  union  of  the  events  E :  “the 
number  rolled  is  even”  and  T :  “the  number  rolled  is  greater  than  two.” 

Solution: 

Since  the  outcomes  that  are  in  either  £  =  {2,4,6}  or 
T  =  1 3, 4, 5, 6 1  (or  both)  are  2,  3,  4,  5,  and  6, 

E  U  T  =  {2, 3, 4, 5, 6}  .  Note  that  an  outcome  such  as  4  that  is  in  both 
sets  is  still  listed  only  once  (although  strictly  speaking  it  is  not  incorrect  to 
list  it  twice). 

In  words  the  union  is  described  by  “the  number  rolled  is  even  or  is  greater 
than  two.”  Every  number  between  one  and  six  except  the  number  one  is 
either  even  or  is  greater  than  two,  corresponding  to  E  U  T  given  above. 
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EXAMPLE  16 


A  two-child  family  is  selected  at  random.  Let  B  denote  the  event  that  at  least 
one  child  is  a  boy,  let  D  denote  the  event  that  the  genders  of  the  two 
children  differ,  and  let  M  denote  the  event  that  the  genders  of  the  two 
children  match.  Find  BUD  and  B  U  M. 

Solution: 

A  sample  space  for  this  experiment  is  S  =  {bb,bg,  gb,  gg}  ,  where  the 
first  letter  denotes  the  gender  of  the  firstborn  child  and  the  second  letter 
denotes  the  gender  of  the  second  child.  The  events  B,  D,  and  M  are 

B=  {bb,bg,gb}  D  =  {bg,gb}  M  =  {bb,gg} 

Each  outcome  in  D  is  already  in  B,  so  the  outcomes  that  are  in  at  least  one  or 
the  other  of  the  sets  B  and  D  is  just  the  set  B  itself: 

B  (J  D  =  {bb,  bg,  gb{  =  B. 

Every  outcome  in  the  whole  sample  space  S  is  in  at  least  one  or  the  other  of 
the  sets  B  and  M,  so  B  U  M  =  { bb,  bg,  gb,  gg }  =  S. 


The  following  Additive  Rule  of  Probability  is  a  useful  formula  for  calculating  the 
probability  of  A  U  B. 


The  next  example,  in  which  we  compute  the  probability  of  a  union  both  by  counting 
and  by  using  the  formula,  shows  why  the  last  term  in  the  formula  is  needed. 
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EXAMPLE  17 


Two  fair  dice  are  thrown.  Find  the  probabilities  of  the  following  events: 

a.  both  dice  show  a  four 

b.  at  least  one  die  shows  a  four 

Solution: 

As  was  the  case  with  tossing  two  identical  coins,  actual  experience  dictates 
that  for  the  sample  space  to  have  equally  likely  outcomes  we  should  list 
outcomes  as  if  we  could  distinguish  the  two  dice.  We  could  imagine  that  one 
of  them  is  red  and  the  other  is  green.  Then  any  outcome  can  be  labeled  as  a 
pair  of  numbers  as  in  the  following  display,  where  the  first  number  in  the 
pair  is  the  number  of  dots  on  the  top  face  of  the  green  die  and  the  second 
number  in  the  pair  is  the  number  of  dots  on  the  top  face  of  the  red  die. 


11 

12 

13 

14 

15 

16 

21 

22 

23 

24 

25 

26 

31 

32 

33 

34 

35 

36 

41 

42 

43 

44 

45 

46 

51 

52 

53 

54 

55 

56 

61 

62 

63 

64 

65 

66 

a.  There  are  36  equally  likely  outcomes,  of  which  exactly  one  corresponds 
to  two  fours,  so  the  probability  of  a  pair  of  fours  is  l/36. 

b.  From  the  table  we  can  see  that  there  are  11  pairs  that  correspond 
to  the  event  in  question:  the  six  pairs  in  the  fourth  row  (the 
green  die  shows  a  four)  plus  the  additional  five  pairs  other  than 
the  pair  44,  already  counted,  in  the  fourth  column  (the  red  die  is 
four),  so  the  answer  is  11/36.  To  see  how  the  formula  gives  the 
same  number,  let  Aq  denote  the  event  that  the  green  die  is  a  four 
and  let  Ar  denote  the  event  that  the  red  die  is  a  four.  Then 
clearly  by  counting  we  get  P  (A  q)  =  6  /  36  and 
P  (A  r)  —  6  /  36.  Since  A  q  D  A  r  =  {  44  }, 

P  (A  Q  C\  A r)  =  1  /  36;  this  is  the  computation  in  part  (a), 
of  course.  Thus  by  the  Additive  Rule  of  Probability, 
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6  6  1 

P(Ag  u  Ar)  =  P(Ag)  +  P(Ar)  -  P(Ag  -  Ar)  =  -  +  —  -  — 

36  36  36 


EXAMPLE  18 


A  tutoring  service  specializes  in  preparing  adults  for  high  school 
equivalence  tests.  Among  all  the  students  seeking  help  from  the  service,  63% 
need  help  in  mathematics,  34%  need  help  in  English,  and  27%  need  help  in 
both  mathematics  and  English.  What  is  the  percentage  of  students  who  need 
help  in  either  mathematics  or  English? 

Solution: 

Imagine  selecting  a  student  at  random,  that  is,  in  such  a  way  that  every 
student  has  the  same  chance  of  being  selected.  Let  M  denote  the  event  “the 
student  needs  help  in  mathematics”  and  let  E  denote  the  event  “the  student 
needs  help  in  English.”  The  information  given  is  that  P  (M)  =  0.63 , 

P  ( E )  =  0.34,  and  P  (M  n  E)  =  0.27 .  The  Additive  Rule  of 
Probability  gives 

P  (M  u  E)  =  P  (M)  +  P  (E)  —  P  (M  C\  E)  =  0.63  +  0.34  -  0.27  =  0.70 


Note  how  the  naive  reasoning  that  if  63%  need  help  in  mathematics  and  34%  need 
help  in  English  then  63  plus  34  or  97%  need  help  in  one  or  the  other  gives  a  number 
that  is  too  large.  The  percentage  that  need  help  in  both  subjects  must  be  subtracted 
off,  else  the  people  needing  help  in  both  are  counted  twice,  once  for  needing  help  in 
mathematics  and  once  again  for  needing  help  in  English.  The  simple  sum  of  the 
probabilities  would  work  if  the  events  in  question  were  mutually  exclusive,  for  then 
P  (A  D  B)  is  zero,  and  makes  no  difference. 
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EXAMPLE  19 


Volunteers  for  a  disaster  relief  effort  were  classified  according  to  both 
specialty  (C:  construction,  E:  education,  M:  medicine)  and  language  ability  (5: 
speaks  a  single  language  fluently,  T:  speaks  two  or  more  languages  fluently). 
The  results  are  shown  in  the  following  two-way  classification  table: 


Specialty 

Language  Ability 

S 

T 

C 

12 

1 

E 

4 

3 

M 

6 

2 

The  first  row  of  numbers  means  that  12  volunteers  whose  specialty  is 
construction  speak  a  single  language  fluently,  and  1  volunteer  whose 
specialty  is  construction  speaks  at  least  two  languages  fluently.  Similarly  for 
the  other  two  rows. 

A  volunteer  is  selected  at  random,  meaning  that  each  one  has  an  equal 
chance  of  being  chosen.  Find  the  probability  that: 

a.  his  specialty  is  medicine  and  he  speaks  two  or  more  languages; 

b.  either  his  specialty  is  medicine  or  he  speaks  two  or  more  languages; 

c.  his  specialty  is  something  other  than  medicine. 

Solution: 

When  information  is  presented  in  a  two-way  classification  table  it  is 
typically  convenient  to  adjoin  to  the  table  the  row  and  column  totals,  to 
produce  a  new  table  like  this: 


Specialty 

Language  Ability 

Total 

S 

T 

C 

12 

1 

13 

E 

4 

3 

7 

M 

6 

2 

8 
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Specialty 

Language  Ability 

Total 

S 

T 

Total 

22 

6 

28 

a.  The  probability  sought  is  P  (M  Pi  T )  .  The  table  shows  that  there  are  2 
such  people,  out  of  28  in  all,  hence  P  (M  0  7")  =  2  /  28  &  0.07  or 
about  a  7%  chance. 

b.  The  probability  sought  is  P  ( M  U  T )  .  The  third  row  total  and 
the  grand  total  in  the  sample  give  P  (M)  =  8  /  28.  The 
second  column  total  and  the  grand  total  give  P  (T)  =  6/28. 

Thus  using  the  result  from  part  (a), 

p(Mu2)  =  P(M)  +  P(r)-P(Mni’)  =  2  +  2-i; 
or  about  a  43%  chance. 

c.  This  probability  can  be  computed  in  two  ways.  Since  the  event  of 
interest  can  be  viewed  as  the  event  CUE  and  the  events  C  and  E 
are  mutually  exclusive,  the  answer  is,  using  the  first  two  row 
totals, 

P(Cu£)  =  P(C)  +  P(£)-P(Cn£)=  ^  +  l8~l8  = 

On  the  other  hand,  the  event  of  interest  can  be  thought  of  as  the 
complement  Mc  of  M,  hence  using  the  value  of  P  ( M )  computed 
in  part  (b), 

8  20 

p(mc)  =  i -pm  = 1  -  ^  =  wa0-n 


as  before. 
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KEY  TAKEAWAY 


•  The  probability  of  an  event  that  is  a  complement  or  union  of  events  of 
known  probability  can  be  computed  using  formulas. 
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1.  For  the  sample  space  S  =  {a,  b,  C,  d,  e)  identify  the  complement  of  each 
event  given. 

a.  A  =  {a,d,  e] 

b.  B  =  {b,  c,d,e} 

c.  S 

2.  For  the  sample  space  S  =  {  r,  S,  t,  U,  V  }  identify  the  complement  of  each 
event  given. 

a.  R  =  {t,U,} 
b-  T=  {r} 

c.  0  (the  “empty”  set  that  has  no  elements) 

3.  The  sample  space  for  three  tosses  of  a  coin  is 

S  =  {hhh,  hht,  hth,  htt,  thh,  tht,  tth,  ttt] 

Define  events 

H  :  at  least  one  head  is  observed 
M  :  more  heads  than  tails  are  observed 

a.  List  the  outcomes  that  comprise  H  and  M. 

b.  List  the  outcomes  that  comprise  H  (T  M,  H  U  M,  and  Hc. 

c.  Assuming  all  outcomes  are  equally  likely,  find  P  (H  Pi  M),  P  (H  U  M), 
and  P(HC)  . 

d.  Determine  whether  or  not  Hc  and  M  are  mutually  exclusive.  Explain  why  or 
why  not. 

4.  For  the  experiment  of  rolling  a  single  six-sided  die  once,  define  events 

T  :  the  number  rolled  is  three 
G  :  the  number  rolled  is  four  or  greater 

a.  List  the  outcomes  that  comprise  T  and  G. 

b.  List  the  outcomes  that  comprise  T  (T  G,  T  U  G,  Tc,  and  ( T  U  G)c  . 

c.  Assuming  all  outcomes  are  equally  likely,  find  P  (T  fl  G),P  (T  U  G), 
and P(TC)  . 

d.  Determine  whether  or  not  T  and  G  are  mutually  exclusive.  Explain  why  or 
why  not. 
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5.  A  special  deck  of  16  cards  has  4  that  are  blue,  4  yellow,  4  green,  and  4  red.  The 
four  cards  of  each  color  are  numbered  from  one  to  four.  A  single  card  is  drawn 
at  random.  Define  events 

B  :  the  card  is  blue 
R  :  the  card  is  red 

N  :  the  number  on  the  card  is  at  most  two 

a.  List  the  outcomes  that  comprise  B,  R,  and  N. 

b.  List  the  outcomes  that  comprise  B  IT  R,  B  U  R,  B  (T  N,  R  U  N,  Bc,  and 

(BuR)c. 

c.  Assuming  all  outcomes  are  equally  likely,  find  the  probabilities  of  the 
events  in  the  previous  part. 

d.  Determine  whether  or  not  B  and  N  are  mutually  exclusive.  Explain  why  or 
why  not. 

6.  In  the  context  of  the  previous  problem,  define  events 

Y  :  the  card  is  yellow 

I  :  the  number  on  the  card  is  not  a  one 

J  :  the  number  on  the  card  is  a  two  or  a  four 

a.  List  the  outcomes  that  comprise  Y,  I,  and  J. 

b.  List  the  outcomes  that  comprise  Y  Cl  I,  Y  U  J,  I  (T  J,  Ic,  and  (Y  U  J)c . 

c.  Assuming  all  outcomes  are  equally  likely,  find  the  probabilities  of  the 
events  in  the  previous  part. 

d.  Determine  whether  or  not  Ic  and  J  are  mutually  exclusive.  Explain  why  or 
why  not. 

7.  The  Venn  diagram  provided  shows  a  sample  space  and  two  events  A  and  B. 

Suppose/5 (a)  =  0.13, P  ( b )  =  0.09, P(c)  =  0.27, P  ( d )  =  0.20, 

and  P  ( e )  =  0.3 1 .  Confirm  that  the  probabilities  of  the  outcomes  add  up  to 
1,  then  compute  the  following  probabilities. 


•  a  f 

•  A 

( •  b 

•  c  J 

•  e  J 

-A? 

a.  P  (A)  . 

b.  P  (. B )  . 
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c.  P  (A  l  )  two  ways:  (i)  by  finding  the  outcomes  in  Ac  and  adding  their 
probabilities,  and  (ii)  using  the  Probability  Rule  for  Complements. 

d.  P  (A  n  B) . 

e.  P  (A  U  B )  two  ways:  (i)  by  finding  the  outcomes  in  A  U  B  and  adding 
their  probabilities,  and  (ii)  using  the  Additive  Rule  of  Probability. 

8.  The  Venn  diagram  provided  shows  a  sample  space  and  two  events  A  and  B. 

Suppose  P  (a)  =  0.32,  P  ( b )  =  0.17,  P(c)  =  0.28,  and 
P  (d)  =  0.23.  Confirm  that  the  probabilities  of  the  outcomes  add  up  to  1, 
then  compute  the  following  probabilities. 


a.  P  (A)  . 

b.  P(B). 

c.  P  (A  c  )  two  ways:  (i)  by  finding  the  outcomes  in  Ac  and  adding  their 
probabilities,  and  (ii)  using  the  Probability  Rule  for  Complements. 

d.  P  (A  n  B) . 

e.  P  (A  U  B )  two  ways:  (i)  by  finding  the  outcomes  in  A  U  B  and  adding 
their  probabilities,  and  (ii)  using  the  Additive  Rule  of  Probability. 

9.  Confirm  that  the  probabilities  in  the  two-way  contingency  table  add  up  to  1, 
then  use  it  to  find  the  probabilities  of  the  events  indicated. 


U 

V 

W 

A 

0.15 

0.00 

0.23 

B 

0.22 

0.30 

0.10 

a.  P(A),P(B),P(AnB). 

b.  P(U),P(W),P(U  nw). 

c.  P(UuW). 

d.  P  (vc)  . 

e.  Determine  whether  or  not  the  events  A  and  U  are  mutually  exclusive;  the 
events  A  and  V. 

10.  Confirm  that  the  probabilities  in  the  two-way  contingency  table  add  up  to  1, 
then  use  it  to  find  the  probabilities  of  the  events  indicated. 
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R 

S 

T 

M 

0.09 

0.25 

0.19 

N 

0.31 

0.16 

0.00 

a.  P(R),P(S),P(RnS). 

b.  P  (M),  P  (N),  P  (M  n  N) . 

C.  P(RuS). 

d.  P  (RC)  . 

e.  Determine  whether  or  not  the  events  N  and  S  are  mutually  exclusive;  the 
events  N  and  T. 


APPLICATIONS 


11.  Make  a  statement  in  ordinary  English  that  describes  the  complement  of  each 
event  (do  not  simply  insert  the  word  “not”). 

a.  In  the  roll  of  a  die:  “five  or  more.” 

b.  In  a  roll  of  a  die:  “an  even  number.” 

c.  In  two  tosses  of  a  coin:  “at  least  one  heads.” 

d.  In  the  random  selection  of  a  college  student:  “Not  a  freshman.” 

12.  Make  a  statement  in  ordinary  English  that  describes  the  complement  of  each 
event  (do  not  simply  insert  the  word  “not”). 

a.  In  the  roll  of  a  die:  “two  or  less.” 

b.  In  the  roll  of  a  die:  “one,  three,  or  four.” 

c.  In  two  tosses  of  a  coin:  “at  most  one  heads.” 

d.  In  the  random  selection  of  a  college  student:  “Neither  a  freshman  nor  a 
senior.” 

13.  The  sample  space  that  describes  all  three-child  families  according  to  the 
genders  of  the  children  with  respect  to  birth  order  is 

5  =  {bbb,  bbg,  bgb,  bgg,  gbb,  gbg,  ggb,  ggg}  . 

For  each  of  the  following  events  in  the  experiment  of  selecting  a  three-child 
family  at  random,  state  the  complement  of  the  event  in  the  simplest  possible 
terms,  then  find  the  outcomes  that  comprise  the  event  and  its  complement. 

a.  At  least  one  child  is  a  girl. 

b.  At  most  one  child  is  a  girl. 

c.  All  of  the  children  are  girls. 

d.  Exactly  two  of  the  children  are  girls. 

e.  The  first  born  is  a  girl. 
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14.  The  sample  space  that  describes  the  two-way  classification  of  citizens 
according  to  gender  and  opinion  on  a  political  issue  is 

•S'  =  {mf,ma,mn,ff,fa,fn}  , 

where  the  first  letter  denotes  gender  (m:  male,  f:  female)  and  the  second 
opinion  (f:  for,  a:  against,  n:  neutral).  For  each  of  the  following  events  in  the 
experiment  of  selecting  a  citizen  at  random,  state  the  complement  of  the  event 
in  the  simplest  possible  terms,  then  find  the  outcomes  that  comprise  the  event 
and  its  complement. 

a.  The  person  is  male. 

b.  The  person  is  not  in  favor. 

c.  The  person  is  either  male  or  in  favor. 

d.  The  person  is  female  and  neutral. 

15.  A  tourist  who  speaks  English  and  German  but  no  other  language  visits  a  region 
of  Slovenia,  if  35%  of  the  residents  speak  English,  15%  speak  German,  and  3% 
speak  both  English  and  German,  what  is  the  probability  that  the  tourist  will  be 
able  to  talk  with  a  randomly  encountered  resident  of  the  region? 

16.  In  a  certain  country  43%  of  all  automobiles  have  airbags,  27%  have  anti-lock 
brakes,  and  13%  have  both.  What  is  the  probability  that  a  randomly  selected 
vehicle  will  have  both  airbags  and  anti-lock  brakes? 

17.  A  manufacturer  examines  its  records  over  the  last  year  on  a  component  part 
received  from  outside  suppliers.  The  breakdown  on  source  (supplier  A, 
supplier  B )  and  quality  (H:  high,  U:  usable,  D:  defective)  is  shown  in  the  two- 
way  contingency  table. 


H 

U 

D 

A 

0.6937 

0.0049 

0.0014 

B 

0.2982 

0.0009 

0.0009 

The  record  of  a  part  is  selected  at  random.  Find  the  probability  of  each  of  the 

following  events. 

a.  The  part  was  defective. 

b.  The  part  was  either  of  high  quality  or  was  at  least  usable,  in  two  ways:  (i) 
by  adding  numbers  in  the  table,  and  (ii)  using  the  answer  to  (a)  and  the 
Probability  Rule  for  Complements. 

c.  The  part  was  defective  and  came  from  supplier  B. 

d.  The  part  was  defective  or  came  from  supplier  B,  in  two  ways:  by  finding  the 
cells  in  the  table  that  correspond  to  this  event  and  adding  their 
probabilities,  and  (ii)  using  the  Additive  Rule  of  Probability. 
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18.  Individuals  with  a  particular  medical  condition  were  classified  according  to  the 
presence  (T)  or  absence  ( N )  of  a  potential  toxin  in  their  blood  and  the  onset  of 
the  condition  ( E :  early,  M:  midrange,  L:  late).  The  breakdown  according  to  this 
classification  is  shown  in  the  two-way  contingency  table. 


E 

M 

L 

T 

0.012 

0.124 

0.013 

N 

0.170 

0.638 

0.043 

One  of  these  individuals  is  selected  at  random.  Find  the  probability  of  each  of 
the  following  events. 

a.  The  person  experienced  early  onset  of  the  condition. 

b.  The  onset  of  the  condition  was  either  midrange  or  late,  in  two  ways:  (i)  by 
adding  numbers  in  the  table,  and  (ii)  using  the  answer  to  (a)  and  the 
Probability  Rule  for  Complements. 

c.  The  toxin  is  present  in  the  person’s  blood. 

d.  The  person  experienced  early  onset  of  the  condition  and  the  toxin  is 
present  in  the  person’s  blood. 

e.  The  person  experienced  early  onset  of  the  condition  or  the  toxin  is  present 
in  the  person’s  blood,  in  two  ways:  (i)  by  finding  the  cells  in  the  table  that 
correspond  to  this  event  and  adding  their  probabilities,  and  (ii)  using  the 
Additive  Rule  of  Probability. 

19.  The  breakdown  of  the  students  enrolled  in  a  university  course  by  class  (F: 
freshman,  So  :  sophomore,  J:  junior,  Se :  senior)  and  academic  major  (S: 
science,  mathematics,  or  engineering,  L:  liberal  arts,  0:  other)  is  shown  in  the 
two-way  classification  table. 


Major 

Class 

F 

So 

J 

Se 

S 

92 

42 

20 

13 

L 

368 

167 

80 

53 

0 

460 

209 

100 

67 

A  student  enrolled  in  the  course  is  selected  at  random.  Adjoin  the  row  and 
column  totals  to  the  table  and  use  the  expanded  table  to  find  the  probability  of 
each  of  the  following  events. 

a.  The  student  is  a  freshman. 

b.  The  student  is  a  liberal  arts  major. 

c.  The  student  is  a  freshman  liberal  arts  major. 
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d.  The  student  is  either  a  freshman  or  a  liberal  arts  major. 

e.  The  student  is  not  a  liberal  arts  major. 

20.  The  table  relates  the  response  to  a  fund-raising  appeal  by  a  college  to  its 
alumni  to  the  number  of  years  since  graduation. 


An  alumnus  is  selected  at  random.  Adjoin  the  row  and  column  totals  to  the 
table  and  use  the  expanded  table  to  find  the  probability  of  each  of  the 
following  events. 

a.  The  alumnus  responded. 

b.  The  alumnus  did  not  respond. 

c.  The  alumnus  graduated  at  least  21  years  ago. 

d.  The  alumnus  graduated  at  least  21  years  ago  and  responded. 


ADDITIONAL  EXERCISES 


21.  The  sample  space  for  tossing  three  coins  is 

S  —  {hhh,  hht,  hth,  htt,  thh,  tht,  tth,  ttt] 


a. 


b. 


c. 


List  the  outcomes  that  correspond  to  the  statement  “All  the  coins  are 
heads.” 

List  the  outcomes  that  correspond  to  the  statement  “Not  all  the  coins  are 
heads.” 

List  the  outcomes  that  correspond  to  the  statement  “All  the  coins  are  not 
heads.” 
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ANSWERS 


1.  a.  {b,  c} 

b.  {«} 

c.  0 

3.  a.  H  =  {hhh,  hht,  hth,  htt,  thh,  tht,  tth } , 

M  =  { hhh,  hht,  hth,  thh } 

b.  H  n  M  =  {hhh,  hht,  hth,  thh }  ,HuM  =  H,Hc=  {ttt} 

c.  P  (H  n  M)  =  4  /  8,P  (H  u  M)  =  7  /  8,P  ( Hc )  =  1  /  8 

d.  Mutually  exclusive  because  they  have  no  elements  in  common. 

5.  a.  B  =  {bl,b2,b3,b4},R=  {rl,  r2,  r3,  r4\ 

N  =  {b\,b2,y\,y2,g\,g2,r\,r2) 

b.  B  n  R  =  0,B  U  R  =  {b\,b2,b3,b4,r\,r2,r3,r4}  , 
BnN=  {bl,b2}, 

RUN  =  {b\,b2,yl,y2,gl,g2,r\,r2,r3,r4)  , 

Bc  =  {y\,y2,y3,y4,g\,g2,g3,g4,r\,r2,r3,r4}  , 
{BuR)c  =  {yl,y2,y3,y4,g\,g2,g3,g4} 

c.  P(BnR)  =  0,P(BuR)  =  8  /  16 ,P(BnN)  =  2/16 , 
P(RUN )  =  10/  16  ,P(BC)  =  12/  16, 

P({BuR)c)  =8/16 

d.  Not  mutually  exclusive  because  they  have  an  element  in  common. 

7.  a.  0.36 

b.  0.78 

c.  0.64 

d.  0.27 

e.  0.87 

9.  a.  P  (A)  =  0.38  ,P  {B)  =  0.62  ,P  (A  n  B)  =  0 

b.  P(U)  =  0.37, P(W)  =  0.33, P(Un  W)  =  0 

c.  0.7 

d.  0.7 

e.  A  and  U  are  not  mutually  exclusive  because  P  (A  D  U)  is  the  nonzero 
number  0.15.  A  and  V  are  mutually  exclusive  because  P  (A  D  V)  =  0. 

11.  a.  “four  or  less” 

b.  “an  odd  number” 

c.  “no  heads”  or  “all  tails” 

d.  “a  freshman” 
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13.  a.  “All  the  children  are  boys.” 

Event:  {  bbg,  bgb,  bgg,  gbb,  gbg,  ggb,  ggg  }  , 

Complement:  {bbb} 

b.  “At  least  two  of  the  children  are  girls”  or  “There  are  two  or  three  girls.” 
Event:  {  bbb,  bbg,  bgb,  gbb  }  , 

Complement:  {bgg,  gbg,  ggb,  ggg} 

c.  “At  least  one  child  is  a  boy.” 

Event:  {ggg}  , 

Complement:  {bbb,  bbg,  bgb,  bgg,  gbb,  gbg,  ggb } 

d.  “There  are  either  no  girls,  exactly  one  girl,  or  three  girls.” 

Event:  {bgg,  gbg,  ggb}  , 

Complement:  {bbb,  bbg,  bgb,  gbb,  ggg} 

e.  “The  first  born  is  a  boy.” 

Event:  {gbb,  gbg,  ggb,  ggg}  , 

Complement:  {bbb,  bbg,  bgb,  bgg } 


15.  0.47 


17.  a.  0.0023 

b.  0.9977 

c.  0.0009 

d.  0.3014 

19.  a.  920/1671 

b.  668/1671 

c.  368/1671 

d.  1220/1671 

e.  1003/1671 

21.  a.  {hhh} 

b.  { hht,  hth,  htt,  thh,  tht,  tth,  ttt} 

c.  {  ttt  } 
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3.3  Conditional  Probability  and  Independent  Events 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  a  conditional  probability  and  how  to  compute  it. 

2.  To  learn  the  concept  of  independence  of  events,  and  how  to  apply  it. 


Conditional  Probability 

Suppose  a  fair  die  has  been  rolled  and  you  are  asked  to  give  the  probability  that  it 
was  a  five.  There  are  six  equally  likely  outcomes,  so  your  answer  is  1/ 6.  But  suppose 
that  before  you  give  your  answer  you  are  given  the  extra  information  that  the 
number  rolled  was  odd.  Since  there  are  only  three  odd  numbers  that  are  possible, 
one  of  which  is  five,  you  would  certainly  revise  your  estimate  of  the  likelihood  that 
a  five  was  rolled  from  l/6  to  l/3.  In  general,  the  revised  probability  that  an  event  A 
has  occurred,  taking  into  account  the  additional  information  that  another  event  B 
has  definitely  occurred  on  this  trial  of  the  experiment,  is  called  the  conditional 
probability  of  A  given  B  and  is  denoted  by  P  (ALB) .  The  reasoning  employed  in  this 
example  can  be  generalized  to  yield  the  computational  formula  in  the  following 
definition. 


Definition 

The  conditional  probability9  of  A  given  B,  denoted  P  (A  IB) ,  is  the  probability  that 
event  A  has  occurred  in  a  trial  of  a  random  experiment  for  which  it  is  known  that  event 
B  has  definitely  occurred.  It  may  be  computed  by  means  of  the  following  formula: 

Rule  for  Conditional  Probability 

P  (A 

p  c a\b )  =  -y 


9.  The  probability  of  the  event  A 
taking  into  account  the  fact 
that  event  B  is  known  to  have 
occurred. 
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EXAMPLE  20 


A  fair  die  is  rolled. 

a.  Find  the  probability  that  the  number  rolled  is  a  five,  given  that  it  is  odd. 

b.  Find  the  probability  that  the  number  rolled  is  odd,  given  that  it  is  a  five. 

Solution: 


The  sample  space  for  this  experiment  is  the  set  S  =  {  1,2,3,4,5,6  j 
consisting  of  six  equally  likely  outcomes.  Let  F  denote  the  event  “a  five  is 
rolled”  and  let  0  denote  the  event  “an  odd  number  is  rolled,”  so  that 


F  =  {5}  and  O =  {1,3,5} 


a.  This  is  the  introductory  example,  so  we  already  know  that  the 
answer  is  l/3.  To  use  the  formula  in  the  definition  to  confirm 
this  we  must  replace  A  in  the  formula  (the  event  whose 
likelihood  we  seek  to  estimate)  by  F  and  replace  B  (the  event  we 
know  for  certain  has  occurred)  by  0: 

P(F 

P(F\0)= 

Since  F  0  O  =  {5}  fl  {  1,3,5}=  {5}, 

P(FnO)  =  1/6. 

Since  0=  {1,3,5}, P  (0)  =  3/6. 

Thus 


P(F\0)  = 


P(FnO) 
P(0 ) 


1/6 

3/6 


1 

3 


b.  This  is  the  same  problem,  but  with  the  roles  of  F  and  0  reversed. 
Since  we  are  given  that  the  number  that  was  rolled  is  five,  which 
is  odd,  the  probability  in  question  must  be  1.  To  apply  the 
formula  to  this  case  we  must  now  replace  A  (the  event  whose 
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likelihood  we  seek  to  estimate)  by  0  and  B  (the  event  we  know  for 
certain  has  occurred)  by  F: 


P  ( 0\F )  = 


P(OnF) 
P(F ) 


Obviously  P  (F)  =  1  /  6.  In  part  (a)  we  found  that 
P  (F  n  O )  —  1  /  6. Thus 


P  (0|F)  = 


P(OnF) 

P(F) 


1/6 

1/6 


-  1 


Just  as  we  did  not  need  the  computational  formula  in  this  example,  we  do  not  need 
it  when  the  information  is  presented  in  a  two-way  classification  table,  as  in  the  next 
example. 
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EXAMPLE  21 


In  a  sample  of  902  individuals  under  40  who  were  or  had  previously  been 
married,  each  person  was  classified  according  to  gender  and  age  at  first 
marriage.  The  results  are  summarized  in  the  following  two-way 
classification  table,  where  the  meaning  of  the  labels  is: 

•  M:  male 

•  F:  female 

•  E:  a  teenager  when  first  married 

•  W\  in  one’s  twenties  when  first  married 

•  H:  in  one’s  thirties  when  first  married 


E 

W 

H 

Total 

M 

43 

293 

114 

450 

F 

82 

299 

71 

452 

Total 

125 

592 

185 

902 

The  numbers  in  the  first  row  mean  that  43  people  in  the  sample  were  men 
who  were  first  married  in  their  teens,  293  were  men  who  were  first  married 
in  their  twenties,  114  men  who  were  first  married  in  their  thirties,  and  a 
total  of  450  people  in  the  sample  were  men.  Similarly  for  the  numbers  in  the 
second  row.  The  numbers  in  the  last  row  mean  that,  irrespective  of  gender, 
125  people  in  the  sample  were  married  in  their  teens,  592  in  their  twenties, 
185  in  their  thirties,  and  that  there  were  902  people  in  the  sample  in  all. 
Suppose  that  the  proportions  in  the  sample  accurately  reflect  those  in  the 
population  of  all  individuals  in  the  population  who  are  under  40  and  who  are 
or  have  previously  been  married.  Suppose  such  a  person  is  selected  at 
random. 

a.  Find  the  probability  that  the  individual  selected  was  a  teenager  at  first 
marriage. 

b.  Find  the  probability  that  the  individual  selected  was  a  teenager  at  first 
marriage,  given  that  the  person  is  male. 

Solution: 

It  is  natural  to  let  E  also  denote  the  event  that  the  person  selected  was  a 
teenager  at  first  marriage  and  to  let  M  denote  the  event  that  the  person 
selected  is  male. 
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a.  According  to  the  table  the  proportion  of  individuals  in  the  sample  who 
were  in  their  teens  at  their  first  marriage  is  125/902.  This  is  the  relative 
frequency  of  such  people  in  the  population,  hence 

P(E)  =  125  /  902  «  0.139  or  about  14%. 

b.  Since  it  is  known  that  the  person  selected  is  male,  all  the  females 
may  be  removed  from  consideration,  so  that  only  the  row  in  the 
table  corresponding  to  men  in  the  sample  applies: 


E 

W 

H 

Total 

M 

43 

293 

114 

450 

The  proportion  of  males  in  the  sample  who  were  in  their  teens  at  their  first 
marriage  is  43/450.  This  is  the  relative  frequency  of  such  people  in  the 
population  of  males,  hence  P  ( E\M )  =  43  /  450  K  0.096  or  about 
10%. 


In  the  next  example,  the  computational  formula  in  the  definition  must  be  used. 
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EXAMPLE  22 


Suppose  that  in  an  adult  population  the  proportion  of  people  who  are  both 
overweight  and  suffer  hypertension  is  0.09;  the  proportion  of  people  who 
are  not  overweight  but  suffer  hypertension  is  0.11;  the  proportion  of  people 
who  are  overweight  but  do  not  suffer  hypertension  is  0.02;  and  the 
proportion  of  people  who  are  neither  overweight  nor  suffer  hypertension  is 
0.78.  An  adult  is  randomly  selected  from  this  population. 

a.  Find  the  probability  that  the  person  selected  suffers  hypertension  given 
that  he  is  overweight. 

b.  Find  the  probability  that  the  selected  person  suffers  hypertension  given 
that  he  is  not  overweight. 

c.  Compare  the  two  probabilities  just  found  to  give  an  answer  to  the 
question  as  to  whether  overweight  people  tend  to  suffer  from 
hypertension. 

Solution: 

Let  H  denote  the  event  “the  person  selected  suffers  hypertension.”  Let  0 
denote  the  event  “the  person  selected  is  overweight.”  The  probability 
information  given  in  the  problem  may  be  organized  into  the  following 
contingency  table: 


O 

oc 

H 

0.09 

0.11 

Hc 

0.02 

0.78 

a.  Using  the  formula  in  the  definition  of  conditional  probability, 

P(HnO)  0.09 


P(H\0)  = 


P(0 ) 


0.09  +  0.02 


=  0.8182 


b.  Using  the  formula  in  the  definition  of  conditional  probability, 


p(h\oc )  = 


P(HnOc)  _  0.11 

P(Oc) 


0.11  +0.78 


=  0.1236 
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c.  P  ( H\0 )  =  0.8 1 82  is  over  six  times  as  large  as 

P  ( H\Oc  )  =  0.1236,  which  indicates  a  much  higher  rate  of 
hypertension  among  people  who  are  overweight  than  among  people 
who  are  not  overweight.  It  might  be  interesting  to  note  that  a  direct 
comparison  of  P  (H  fl  O)  =  0.09  and  P  (//  fl  Oc  )  =0.11  does 
not  answer  the  same  question. 


Independent  Events 

Although  typically  we  expect  the  conditional  probability  P  ( A\B )  to  be  different 
from  the  probability  P  (A)  of  A,  it  does  not  have  to  be  different  from  P  (A)  .  When 
P  (A  If?)  =  P  (A),  the  occurrence  of  B  has  no  effect  on  the  likelihood  of  A.  Whether 
or  not  the  event  A  has  occurred  is  independent  of  the  event  B. 


Using  algebra  it  can  be  shown  that  the  equality  P  (A  If?)  =  P  (A)  holds  if  and  only  if 
the  equality  P  (A  D  f?)  =  P  (A)  ■  P  (f?)  holds,  which  in  turn  is  true  if  and  only  if 
P  (f?IA)  =  P  (f?) .  This  is  the  basis  for  the  following  definition. 


Definition 

Events  A  and  B  are  independent10  if 

P(A  nB)  =  P  (A)  •  P  (B) 

If  A  and  B  are  not  independent  then  they  are  dependent. 


The  formula  in  the  definition  has  two  practical  but  exactly  opposite  uses: 


10.  Events  whose  probability  of 
occurring  together  is  the 
product  of  their  individual 
probabilities. 


1.  In  a  situation  in  which  we  can  compute  all  three  probabilities  P  (A), 

P  (f?),  and  P  (A  D  f?) ,  it  is  used  to  check  whether  or  not  the  events  A 
and  B  are  independent: 

°  If  P  (A  D  f?)  =  P  (A)  •  P  (f?),  then  A  and  B  are  independent. 

°  if  P  (A  n  f?)  A  P  (A)  •  P  (f?),  then  A  and  B  are  not  independent. 

2.  In  a  situation  in  which  each  of  P  (A)  and  P  (f?)  can  be  computed  and  it 
is  known  that  A  and  B  are  independent,  then  we  can  compute 
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P  (A  n  B)  by  multiplying  together  P  (A)  and  P  ( B ): 
P(An  B)  =  P(A)-P(B). 


EXAMPLE  23 


A  single  fair  die  is  rolled.  Let  A  =  {3}  and  B  =  |l,355|  .  Are  A  and  B 
independent? 

Solution: 

In  this  example  we  can  compute  all  three  probabilities  P  (A)  =  1/6, 

P  (B)  =  1  /  2,  and  P  (A  H  B)  =  P({3})  =  1  /  6.  Since  the  product 
P  (A)  •  P  (B)  =  (1/6)  (l  /  2)  =  1  /  1 2  is  not  the  same  number 
as  P  (A  f|  B)  =  1  /  6 ,  the  events  A  and  B  are  not  independent. 
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The  two-way  classification  of  married  or  previously  married  adults  under  40 
according  to  gender  and  age  at  first  marriage  in  Note  3.48  "Example  21" 
produced  the  table 


E 

W 

H 

Total 

M 

43 

293 

114 

450 

F 

82 

299 

71 

452 

Total 

125 

592 

185 

902 

Determine  whether  or  not  the  events  F:  “female”  and  E:  “was  a  teenager  at 
first  marriage”  are  independent. 

Solution: 

The  table  shows  that  in  the  sample  of  902  such  adults,  452  were  female,  125 
were  teenagers  at  their  first  marriage,  and  82  were  females  who  were 
teenagers  at  their  first  marriage,  so  that 


902 


125 


Since 


is  not  the  same  as 


P(F  OE)  =  — =  0.091 


we  conclude  that  the  two  events  are  not  independent. 
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EXAMPLE  25 


Many  diagnostic  tests  for  detecting  diseases  do  not  test  for  the  disease 
directly  but  for  a  chemical  or  biological  product  of  the  disease,  hence  are  not 
perfectly  reliable.  The  sensitivity  of  a  test  is  the  probability  that  the  test  will 
be  positive  when  administered  to  a  person  who  has  the  disease.  The  higher 
the  sensitivity,  the  greater  the  detection  rate  and  the  lower  the  false 
negative  rate. 

Suppose  the  sensitivity  of  a  diagnostic  procedure  to  test  whether  a  person 
has  a  particular  disease  is  92%.  A  person  who  actually  has  the  disease  is 
tested  for  it  using  this  procedure  by  two  independent  laboratories. 

a.  What  is  the  probability  that  both  test  results  will  be  positive? 

b.  What  is  the  probability  that  at  least  one  of  the  two  test  results  will  be 
positive? 

Solution: 


a.  Let  A\  denote  the  event  “the  test  by  the  first  laboratory  is 
positive”  and  let  A2  denote  the  event  “the  test  by  the  second 
laboratory  is  positive.”  Since  A\  and  A2  are  independent, 

P  (A  1  n  A2)  =  P(A  1)  •  P  (A  2)  =  0.92x0.92  =  0.8464 

b.  Using  the  Additive  Rule  for  Probability  and  the  probability  just 
computed, 

P(Aj  ua2)  =  p(Ai)  +  p(A2)-p(A1  n a2)  =  0.92  +  0. 
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EXAMPLE  26 


The  specificity  of  a  diagnostic  test  for  a  disease  is  the  probability  that  the  test 
will  be  negative  when  administered  to  a  person  who  does  not  have  the 
disease.  The  higher  the  specificity,  the  lower  the  false  positive  rate. 

Suppose  the  specificity  of  a  diagnostic  procedure  to  test  whether  a  person 
has  a  particular  disease  is  89%. 

a.  A  person  who  does  not  have  the  disease  is  tested  for  it  using  this 
procedure.  What  is  the  probability  that  the  test  result  will  be  positive? 

b.  A  person  who  does  not  have  the  disease  is  tested  for  it  by  two 
independent  laboratories  using  this  procedure.  What  is  the  probability 
that  both  test  results  will  be  positive? 

Solution: 


a.  Let  B  denote  the  event  “the  test  result  is  positive.”  The 
complement  of  B  is  that  the  test  result  is  negative,  and  has 
probability  the  specificity  of  the  test,  0.89.  Thus 

P  (B)  =  1  -P(BC)  =  1  -0.89  =  0.11. 

b.  Let  Bi  denote  the  event  “the  test  by  the  first  laboratory  is 
positive”  and  let  B2  denote  the  event  “the  test  by  the  second 
laboratory  is  positive.”  Since  B\  and  B2  are  independent,  by  part 
(a)  of  the  example 

P(B  1  D  B2)  =  P(By)-P(B2)  =  0.11  X  0.11  =  0.0121. 


The  concept  of  independence  applies  to  any  number  of  events.  For  example,  three 
events  A,  B,  and  C  are  independent  if  P  (A  IT  B  fl  C)  —  P  (A)  •  P  ( B )  •  P  (C)  .  Note 
carefully  that,  as  is  the  case  with  just  two  events,  this  is  not  a  formula  that  is  always 
valid,  but  holds  precisely  when  the  events  in  question  are  independent. 
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EXAMPLE  27 


The  reliability  of  a  system  can  be  enhanced  by  redundancy,  which  means 
building  two  or  more  independent  devices  to  do  the  same  job,  such  as  two 
independent  braking  systems  in  an  automobile. 

Suppose  a  particular  species  of  trained  dogs  has  a  90%  chance  of  detecting 
contraband  in  airline  luggage.  If  the  luggage  is  checked  three  times  by  three 
different  dogs  independently  of  one  another,  what  is  the  probability  that 
contraband  will  be  detected? 

Solution: 

Let  L>i  denote  the  event  that  the  contraband  is  detected  by  the  first  dog,  I>2 
the  event  that  it  is  detected  by  the  second  dog,  and  D 3  the  event  that  it  is 
detected  by  the  third.  Since  each  dog  has  a  90%  of  detecting  the  contraband, 
by  the  Probability  Rule  for  Complements  it  has  a  10%  chance  of  failing.  In 

symbols,  P  (D\)  =  0.10,  P(Dc2)  =  0.10,  and  P  (Dc3)  =  0.10. 

Let  D  denote  the  event  that  the  contraband  is  detected.  We  seek  P  ( D )  .  It  is 
easier  to  find  P  ( DL  ) ,  because  although  there  are  several  ways  for  the 
contraband  to  be  detected,  there  is  only  one  way  for  it  to  go  undetected:  all 
three  dogs  must  fail.  Thus  Dc  =  D  D\  0  and 

p  (D)  =  1  -  P  (Dc)  =  1  -  p  (D\  n  dc2  n  dc3 ) 

But  the  events  Di,  D 2,  and  D3  are  independent,  which  implies  that  their 
complements  are  independent,  so 

p(d\  n d2  0Dc3)  =  p (p\ )  -p(dc2)  ■ p(dc3 )  =  0.10 X  0.10  X 0.10  =  0 

Using  this  number  in  the  previous  display  we  obtain 

P  (D)  =  1  -  0.001  =  0.999 

That  is,  although  any  one  dog  has  only  a  90%  chance  of  detecting  the 
contraband,  three  dogs  working  independently  have  a  99.9%  chance  of 
detecting  it. 
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Probabilities  on  Tree  Diagrams 

Some  probability  problems  are  made  much  simpler  when  approached  using  a  tree 
diagram.  The  next  example  illustrates  how  to  place  probabilities  on  a  tree  diagram 
and  use  it  to  solve  a  problem. 
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Ajar  contains  10  marbles,  7  black  and  3  white.  Two  marbles  are  drawn 
without  replacement,  which  means  that  the  first  one  is  not  put  back  before 
the  second  one  is  drawn. 

a.  What  is  the  probability  that  both  marbles  are  black? 

b.  What  is  the  probability  that  exactly  one  marble  is  black? 

c.  What  is  the  probability  that  at  least  one  marble  is  black? 


Solution: 


A  tree  diagram  for  the  situation  of  drawing  one  marble  after  the  other 


without  replacement  is  shown  in  Figure  3,6  "Tree  Diagram  for  Drawing  Two 
Marbles".  The  circle  and  rectangle  will  be  explained  later,  and  should  be 
ignored  for  now. 

Figure  3.6 

Tree  Diagram  for  Drawing  Two  Marbles 


|  ^  W  p(W i  n  w2)  =  a  •  §  =  0.07 


The  numbers  on  the  two  leftmost  branches  are  the  probabilities  of  getting 
either  a  black  marble,  7  out  of  10,  or  a  white  marble,  3  out  of  10,  on  the  first 
draw.  The  number  on  each  remaining  branch  is  the  probability  of  the  event 
corresponding  to  the  node  on  the  right  end  of  the  branch  occurring,  given 
that  the  event  corresponding  to  the  node  on  the  left  end  of  the  branch  has 
occurred.  Thus  for  the  top  branch,  connecting  the  two  Bs,  it  is  P  (B  2  \B  1 ) 
where  Bi  denotes  the  event  “the  first  marble  drawn  is  black”  and  B2  denotes 
the  event  “the  second  marble  drawn  is  black.”  Since  after  drawing  a  black 
marble  out  there  are  9  marbles  left,  of  which  6  are  black,  this  probability  is 


6/9. 
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The  number  to  the  right  of  each  final  node  is  computed  as  shown,  using  the 
principle  that  if  the  formula  in  the  Conditional  Rule  for  Probability  is 
multiplied  by  P  (5),  then  the  result  is 

P(B  n  A)  =  P(B)  ■  P  (Alfl) 

a.  The  event  “both  marbles  are  black”  is  B  \  0  Z?  2  and  corresponds  to  the 
top  right  node  in  the  tree,  which  has  been  circled.  Thus  as  indicated 
there,  it  is  0.47. 

b.  The  event  “exactly  one  marble  is  black”  corresponds  to  the  two  nodes  of 
the  tree  enclosed  by  the  rectangle.  The  events  that  correspond  to  these 
two  nodes  are  mutually  exclusive:  black  followed  by  white  is 
incompatible  with  white  followed  by  black.  Thus  in  accordance  with  the 
Additive  Rule  for  Probability  we  merely  add  the  two  probabilities  next  to 
these  nodes,  since  what  would  be  subtracted  from  the  sum  is  zero.  Thus 
the  probability  of  drawing  exactly  one  black  marble  in  two  tries  is 

0.23  +  0.23  =  0.46. 

c.  The  event  “at  least  one  marble  is  black”  corresponds  to  the  three 
nodes  of  the  tree  enclosed  by  either  the  circle  or  the  rectangle. 

The  events  that  correspond  to  these  nodes  are  mutually 
exclusive,  so  as  in  part  (b)  we  merely  add  the  probabilities  next 
to  these  nodes.  Thus  the  probability  of  drawing  at  least  one  black 

marble  in  two  tries  is  0.47  +  0.23  +  0.23  =  0.93. 

Of  course,  this  answer  could  have  been  found  more  easily  using 
the  Probability  Law  for  Complements,  simply  subtracting  the 
probability  of  the  complementary  event,  “two  white  marbles  are 
drawn,”  from  1  to  obtain  1  —  0.07  =  0.93. 


As  this  example  shows,  finding  the  probability  for  each  branch  is  fairly 
straightforward,  since  we  compute  it  knowing  everything  that  has  happened  in  the 
sequence  of  steps  so  far.  Two  principles  that  are  true  in  general  emerge  from  this 
example: 
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Probabilities  on  Tree  Diagrams 

1.  The  probability  of  the  event  corresponding  to  any  node  on  a  tree  is 
the  product  of  the  numbers  on  the  unique  path  of  branches  that 
leads  to  that  node  from  the  start. 

2.  if  an  event  corresponds  to  several  final  nodes,  then  its  probability 
is  obtained  by  adding  the  numbers  next  to  those  nodes. 


KEY  TAKEAWAYS 


•  A  conditional  probability  is  the  probability  that  an  event  has  occurred, 
taking  into  account  additional  information  about  the  result  of  the 
experiment. 

•  A  conditional  probability  can  always  be  computed  using  the  formula  in 
the  definition.  Sometimes  it  can  be  computed  by  discarding  part  of  the 
sample  space. 

•  Two  events  A  and  B  are  independent  if  the  probability  P  (A  D  B)  of 
their  intersection  A  IT  B  is  equal  to  the  product  P  (A)  •  P  ( B )  of  their 
individual  probabilities. 
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1.  For  two  events  A  and  B,  P  (A)  =  0.73  ,P(B)  =  0.48,  and 
P  (A  n  B)  =  0.29. 

a.  Find  P  (AIZ?)  . 

b.  FindP(PIA). 

c.  Determine  whether  or  not  A  and  B  are  independent. 

2.  For  two  events  A  and  B,  P  (A)  =  0.26  ,  P(P)  =  0.37,  and 
P(AnB)  =  0.11. 

a.  FindP(AIP). 

b.  FindP(PIA). 

c.  Determine  whether  or  not  A  and  B  are  independent. 

3.  For  independent  events  A  and  B,  P  (A)  =  0.81  and  P  (P)  =  0.27. 

a.  Find  P  (A  f)  B)  . 

b.  FindP(AIP). 

c.  FindP(PIA). 

4.  For  independent  events  A  and  B,  P  (A)  =  0.68  and  P  (P)  =  0.37. 

a.  Find  P  (A  fl  B)  . 

b.  FindP(AIP). 

c.  FindP(PIA). 

5.  For  mutually  exclusive  events  A  and  B,  P  (A)  =  0.17  and  P  (B)  =  0.32. 

a.  FindP(AIP). 

b.  FindP(PIA). 

6.  For  mutually  exclusive  events  A  and  B,  P  (A)  =  0.45  and  P  ( B )  =  0.09. 

a.  FindP(AIP). 

b.  FindP(PIA). 

7.  Compute  the  following  probabilities  in  connection  with  the  roll  of  a  single  fair 
die. 


a.  The  probability  that  the  roll  is  even. 

b.  The  probability  that  the  roll  is  even,  given  that  it  is  not  a  two. 

c.  The  probability  that  the  roll  is  even,  given  that  it  is  not  a  one. 
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8.  Compute  the  following  probabilities  in  connection  with  two  tosses  of  a  fair 
coin. 

a.  The  probability  that  the  second  toss  is  heads. 

b.  The  probability  that  the  second  toss  is  heads,  given  that  the  first  toss  is 
heads. 

c.  The  probability  that  the  second  toss  is  heads,  given  that  at  least  one  of  the 
two  tosses  is  heads. 

9.  A  special  deck  of  16  cards  has  4  that  are  blue,  4  yellow,  4  green,  and  4  red.  The 
four  cards  of  each  color  are  numbered  from  one  to  four.  A  single  card  is  drawn 
at  random.  Find  the  following  probabilities. 

a.  The  probability  that  the  card  drawn  is  red. 

b.  The  probability  that  the  card  is  red,  given  that  it  is  not  green. 

c.  The  probability  that  the  card  is  red,  given  that  it  is  neither  red  nor  yellow. 

d.  The  probability  that  the  card  is  red,  given  that  it  is  not  a  four. 

10.  A  special  deck  of  16  cards  has  4  that  are  blue,  4  yellow,  4  green,  and  4  red.  The 
four  cards  of  each  color  are  numbered  from  one  to  four.  A  single  card  is  drawn 
at  random.  Find  the  following  probabilities. 

a.  The  probability  that  the  card  drawn  is  a  two  or  a  four. 

b.  The  probability  that  the  card  is  a  two  or  a  four,  given  that  it  is  not  a  one. 

c.  The  probability  that  the  card  is  a  two  or  a  four,  given  that  it  is  either  a  two 
or  a  three. 

d.  The  probability  that  the  card  is  a  two  or  a  four,  given  that  it  is  red  or 
green. 

11.  A  random  experiment  gave  rise  to  the  two-way  contingency  table  shown.  Use 
it  to  compute  the  probabilities  indicated. 


R 

S 

A 

0.12 

0.18 

B 

0.28 

0.42 

a.  P(A),P(R),P(AnR). 

b.  Based  on  the  answer  to  (a),  determine  whether  or  not  the  events  A  and  R 
are  independent. 

c.  Based  on  the  answer  to  (b),  determine  whether  or  not  P  (A IT?)  can  be 
predicted  without  any  computation,  if  so,  make  the  prediction.  In  any  case, 
compute  P  (AIT?)  using  the  Rule  for  Conditional  Probability. 

12.  A  random  experiment  gave  rise  to  the  two-way  contingency  table  shown.  Use 
it  to  compute  the  probabilities  indicated. 
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R 

S 

A 

0.13 

0.07 

B 

0.61 

0.19 

a.  P(A),P(R),P(AnR). 

b.  Based  on  the  answer  to  (a),  determine  whether  or  not  the  events  A  and  R 
are  independent. 

c.  Based  on  the  answer  to  (b),  determine  whether  or  not  P  (A I/?)  can  be 
predicted  without  any  computation,  if  so,  make  the  prediction.  In  any  case, 
compute  P  (AIT?)  using  the  Rule  for  Conditional  Probability. 

13.  Suppose  for  events  A  and  B  in  a  random  experiment  P  (A)  =  0.70  and 

P  (B)  =  0.30.  Compute  the  indicated  probability,  or  explain  why  there  is 
not  enough  information  to  do  so. 

a.  P  (A  nB). 

b.  P  (A  D  B)  ,  with  the  extra  information  that  A  and  B  are  independent. 

c.  P  (A  f|  B)  ,  with  the  extra  information  that  A  and  B  are  mutually 

exclusive. 

14.  Suppose  for  events  A  and  B  connected  to  some  random  experiment, 

P  (A)  =  0.50  and  P{B)  =  0.50.  Compute  the  indicated  probability,  or 
explain  why  there  is  not  enough  information  to  do  so. 

a.  P  (A  n  B)  . 

b.  P  (A  f)  B)  ,  with  the  extra  information  that  A  and  B  are  independent. 

c.  P  (A  f)  B)  ,  with  the  extra  information  that  A  and  B  are  mutually 

exclusive. 

15.  Suppose  for  events  A,  B,  and  C  connected  to  some  random  experiment,  A,  B,  and 
C are  independent  and P  (A)  =  0.88  ,P(B)  =  0.65,  and 

P  (C)  =  0.44.  Compute  the  indicated  probability,  or  explain  why  there  is 
not  enough  information  to  do  so. 

a.  P  (A  fl  B  fl  C) 

b.  p  (Ac  nscn  Cc) 

16.  Suppose  for  events  A,  B,  and  C  connected  to  some  random  experiment,  A,  B,  and 
C  are  independent  and  P  (A)  =  0.95,P(T?)  =  0.73,  and 

P  ( C )  =  0.62.  Compute  the  indicated  probability,  or  explain  why  there  is 
not  enough  information  to  do  so. 

a.  P  (A  fl  B  fl  C) 

b.  p  (Ac  nBc  nCc) 
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APPLICATIONS 


17.  The  sample  space  that  describes  all  three-child  families  according  to  the 
genders  of  the  children  with  respect  to  birth  order  is 

S  =  {bbb,  bbg,  bgb,  bgg,  gbb,  gbg,  ggb,  ggg} 

In  the  experiment  of  selecting  a  three-child  family  at  random,  compute  each  of 
the  following  probabilities,  assuming  all  outcomes  are  equally  likely. 

a.  The  probability  that  the  family  has  at  least  two  boys. 

b.  The  probability  that  the  family  has  at  least  two  boys,  given  that  not  all  of 

the  children  are  girls. 

c.  The  probability  that  at  least  one  child  is  a  boy. 

d.  The  probability  that  at  least  one  child  is  a  boy,  given  that  the  first  born  is  a 

girl. 

18.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 
population  in  a  particular  locale  according  to  age  and  number  of  vehicular 
moving  violations  in  the  past  three  years: 


A  person  is  selected  at  random.  Find  the  following  probabilities. 

a.  The  person  is  under  21. 

b.  The  person  has  had  at  least  two  violations  in  the  past  three  years. 

c.  The  person  has  had  at  least  two  violations  in  the  past  three  years,  given 
that  he  is  under  21. 

d.  The  person  is  under  21,  given  that  he  has  had  at  least  two  violations  in  the 
past  three  years. 

e.  Determine  whether  the  events  “the  person  is  under  21”  and  “the  person 
has  had  at  least  two  violations  in  the  past  three  years”  are  independent  or 
not. 

19.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 

population  in  a  particular  locale  according  to  party  affiliation  (A,  B,  C,  or  None ) 

and  opinion  on  a  bond  issue: 
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Affiliation 

Opinion 

Favors 

Opposes 

Undecided 

A 

0.12 

0.09 

0.07 

B 

0.16 

0.12 

0.14 

C 

0.04 

0.03 

0.06 

None 

0.08 

0.06 

0.03 

A  person  is  selected  at  random.  Find  each  of  the  following  probabilities. 

a.  The  person  is  in  favor  of  the  bond  issue. 

b.  The  person  is  in  favor  of  the  bond  issue,  given  that  he  is  affiliated  with 
party  A. 

c.  The  person  is  in  favor  of  the  bond  issue,  given  that  he  is  affiliated  with 
party  B. 

20.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 

population  of  patrons  at  a  grocery  store  according  to  the  number  of  items 
purchased  and  whether  or  not  the  patron  made  an  impulse  purchase  at  the 
checkout  counter: 


Number  of  Items 

Impulse  Purchase 

Made 

Not  Made 

Few 

0.01 

0.19 

Many 

0.04 

0.76 

A  patron  is  selected  at  random.  Find  each  of  the  following  probabilities. 

a.  The  patron  made  an  impulse  purchase. 

b.  The  patron  made  an  impulse  purchase,  given  that  the  total  number  of 
items  purchased  was  many. 

c.  Determine  whether  or  not  the  events  “few  purchases”  and  “made  an 
impulse  purchase  at  the  checkout  counter”  are  independent. 

21.  The  following  two-way  contingency  table  gives  the  breakdown  of  the 

population  of  adults  in  a  particular  locale  according  to  employment  type  and 
level  of  life  insurance: 


Employment  Type 

Level  of  Insurance 

Low 

Medium 

High 

Unskilled 

0.07 

0.19 

0.00 
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Employment  Type 

Level  of Insurance 

Low 

Medium 

High 

Semi-skilled 

0.04 

0.28 

0.08 

Skilled 

0.03 

0.18 

0.05 

Professional 

0.01 

0.05 

0.02 

An  adult  is  selected  at  random.  Find  each  of  the  following  probabilities. 

a.  The  person  has  a  high  level  of  life  insurance. 

b.  The  person  has  a  high  level  of  life  insurance,  given  that  he  does  not  have  a 
professional  position. 

c.  The  person  has  a  high  level  of  life  insurance,  given  that  he  has  a 
professional  position. 

d.  Determine  whether  or  not  the  events  “has  a  high  level  of  life  insurance” 
and  “has  a  professional  position”  are  independent. 

22.  The  sample  space  of  equally  likely  outcomes  for  the  experiment  of  rolling  two 
fair  dice  is 


11 

12 

13 

14 

15 

16 

21 

22 

23 

24 

25 

26 

31 

32 

33 

34 

35 

36 

41 

42 

43 

44 

45 

46 

51 

52 

53 

54 

55 

56 

61 

62 

63 

64 

65 

66 

Identify  the  events  N:  the  sum  is  at  least  nine,  T:  at  least  one  of  the  dice  is  a 
two,  and  F:  at  least  one  of  the  dice  is  a  five. 

a.  FindP  (N)  . 

b.  Find  P(N\F). 

c.  FindP(W|r). 

d.  Determine  from  the  previous  answers  whether  or  not  the  events  N  and  F 
are  independent;  whether  or  not  N  and  T  are. 

23.  The  sensitivity  of  a  drug  test  is  the  probability  that  the  test  will  be  positive 
when  administered  to  a  person  who  has  actually  taken  the  drug.  Suppose  that 
there  are  two  independent  tests  to  detect  the  presence  of  a  certain  type  of 
banned  drugs  in  athletes.  One  has  sensitivity  0.75;  the  other  has  sensitivity 
0.85.  If  both  are  applied  to  an  athlete  who  has  taken  this  type  of  drug,  what  is 
the  chance  that  his  usage  will  go  undetected? 
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24.  A  man  has  two  lights  in  his  well  house  to  keep  the  pipes  from  freezing  in 
winter.  He  checks  the  lights  daily.  Each  light  has  probability  0.002  of  burning 
out  before  it  is  checked  the  next  day  (independently  of  the  other  light). 

a.  if  the  lights  are  wired  in  parallel  one  will  continue  to  shine  even  if  the 
other  burns  out.  In  this  situation,  compute  the  probability  that  at  least  one 
light  will  continue  to  shine  for  the  full  24  hours.  Note  the  greatly  increased 
reliability  of  the  system  of  two  bulbs  over  that  of  a  single  bulb. 

b.  if  the  lights  are  wired  in  series  neither  one  will  continue  to  shine  even  if 
only  one  of  them  burns  out.  In  this  situation,  compute  the  probability  that 
at  least  one  light  will  continue  to  shine  for  the  full  24  hours.  Note  the 
slightly  decreased  reliability  of  the  system  of  two  bulbs  over  that  of  a 
single  bulb. 

25.  An  accountant  has  observed  that  5%  of  all  copies  of  a  particular  two-part  form 
have  an  error  in  Part  I,  and  2%  have  an  error  in  Part  II.  if  the  errors  occur 
independently,  find  the  probability  that  a  randomly  selected  form  will  be 
error-free. 

26.  A  box  contains  20  screws  which  are  identical  in  size,  but  12  of  which  are  zinc 
coated  and  8  of  which  are  not.  Two  screws  are  selected  at  random,  without 
replacement. 

a.  Find  the  probability  that  both  are  zinc  coated. 

b.  Find  the  probability  that  at  least  one  is  zinc  coated. 


ADDITIONAL  EXERCISES 


27.  Events  A  and  B  are  mutually  exclusive.  Find  P  (A  15)  . 

28.  The  city  council  of  a  particular  city  is  composed  of  five  members  of  party  A, 
four  members  of  party  B,  and  three  independents.  Two  council  members  are 
randomly  selected  to  form  an  investigative  committee. 

a.  Find  the  probability  that  both  are  from  party  A. 

b.  Find  the  probability  that  at  least  one  is  an  independent. 

c.  Find  the  probability  that  the  two  have  different  party  affiliations  (that  is, 
not  both  A,  not  both  B,  and  not  both  independent). 

29.  A  basketball  player  makes  60%  of  the  free  throws  that  he  attempts,  except  that 
if  he  has  just  tried  and  missed  a  free  throw  then  his  chances  of  making  a 
second  one  go  down  to  only  30%.  Suppose  he  has  just  been  awarded  two  free 
throws. 

a.  Find  the  probability  that  he  makes  both. 
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b.  Find  the  probability  that  he  makes  at  least  one.  (A  tree  diagram  could 
help.) 

30.  An  economist  wishes  to  ascertain  the  proportion  p  of  the  population  of 

individual  taxpayers  who  have  purposely  submitted  fraudulent  information  on 
an  income  tax  return.  To  truly  guarantee  anonymity  of  the  taxpayers  in  a 
random  survey,  taxpayers  questioned  are  given  the  following  instructions. 

1.  Flip  a  coin. 

2.  If  the  coin  lands  heads,  answer  “Yes”  to  the  question  “Have  you  ever 
submitted  fraudulent  information  on  a  tax  return?”  even  if  you  have  not. 

3.  If  the  coin  lands  tails,  give  a  truthful  “Yes”  or  “No”  answer  to  the  question 
“Have  you  ever  submitted  fraudulent  information  on  a  tax  return?” 

The  questioner  is  not  told  how  the  coin  landed,  so  he  does  not  know  if  a  “Yes” 
answer  is  the  truth  or  is  given  only  because  of  the  coin  toss. 

a.  Using  the  Probability  Rule  for  Complements  and  the  independence  of  the 
coin  toss  and  the  taxpayers’  status  fill  in  the  empty  cells  in  the  two-way 
contingency  table  shown.  Assume  that  the  coin  is  fair.  Each  cell  except  the 
two  in  the  bottom  row  will  contain  the  unknown  proportion  (or 
probability)  p. 


Status 

Coin 

Probability 

H 

T 

Fraud 

V 

No  fraud 

Probability 

1 

b.  The  only  information  that  the  economist  sees  are  the  entries  in  the 
following  table: 


Response 

“Yes” 

“No” 

Proportion 

r 

s 

Equate  the  entry  in  the  one  cell  in  the  table  in  (a)  that  corresponds  to  the 
answer  “No”  to  the  number  s  to  obtain  the  formula  p  =  1  —  2  s  that 
expresses  the  unknown  number  p  in  terms  of  the  known  number  s. 

c.  Equate  the  sum  of  the  entries  in  the  three  cells  in  the  table  in  (a)  that 
together  correspond  to  the  answer  “Yes”  to  the  number  r  to  obtain  the 
formula  p  =  2  r—  1  that  expresses  the  unknown  number  p  in  terms  of  the 
known  number  r. 
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d.  Use  the  fact  that  r  +  S  =  1  (since  they  are  the  probabilities  of 
complementary  events)  to  verify  that  the  formulas  in  (b)  and  (c)  give  the 
same  value  for  p.  (For  example,  insert  5=1—  rinto  the  formula  in  (b)  to 
obtain  the  formula  in  (c).) 

e.  Suppose  a  survey  of  1,200  taxpayers  is  conducted  and  690  respond  “Yes” 
(truthfully  or  not)  to  the  question  “Flave  you  ever  submitted  fraudulent 
information  on  a  tax  return?”  Use  the  answer  to  either  (b)  or  (c)  to 
estimate  the  true  proportion  p  of  all  individual  taxpayers  who  have 
purposely  submitted  fraudulent  information  on  an  income  tax  return. 
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1 _ 

ANSWERS 

1.  a. 

b. 

0.6 

0.4 

c. 

not  independent 

3.  a. 

b. 

0.22 

0.81 

c. 

0.27 

5.  a. 

b. 

0 

0 

7.  a. 

b. 

0.5 

0.4 

c. 

0.6 

9.  a. 

b. 

0.25 

0.33 

c. 

d. 

0 

0.25 

11.  a. 

b. 

P  (A)  =  0.3,  P(R)  =  0.4, P  (A  OR)  =  0.12 

independent 

c. 

without  computation  0.3 

13.  a. 

Insufficient  information.  The  events  A  and  B  are  not  known  to  be  either 

b. 

independent  or  mutually  exclusive. 

0.21 

c. 

0 

15.  a. 

b. 

0.25 

0.02 

17.  a. 

b. 

0.5 

0.57 

c. 

d. 

0.875 

0.75 

19.  a. 

b. 

0.4 

0.43 

c. 

0.38 

21.  a. 

b. 

0.15 

0.14 

c. 

0.25 
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d.  not  independent 
23.  0.0375 
25.  0.931 
27.  0 

29.  a.  0.36 
b.  0.72 
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It  is  often  the  case  that  a  number  is  naturally  associated  to  the  outcome  of  a  random 
experiment:  the  number  of  boys  in  a  three-child  family,  the  number  of  defective 
light  bulbs  in  a  case  of  100  bulbs,  the  length  of  time  until  the  next  customer  arrives 
at  the  drive-through  window  at  a  bank.  Such  a  number  varies  from  trial  to  trial  of 
the  corresponding  experiment,  and  does  so  in  a  way  that  cannot  be  predicted  with 
certainty;  hence,  it  is  called  a  random  variable.  In  this  chapter  and  the  next  we  study 
such  variables. 
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4.1  Random  Variables 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  a  random  variable. 

2.  To  learn  the  distinction  between  discrete  and  continuous  random 
variables. 


Definition 

A  random  variable1  is  a  numerical  quantity  that  is  generated  by  a  random 
experiment. 


We  will  denote  random  variables  by  capital  letters,  such  as  X  or  Z,  and  the  actual 
values  that  they  can  take  by  lowercase  letters,  such  as  x  and  z. 


Table  4.1  "Four  Random  Variables"  gives  four  examples  of  random  variables.  In  the 
second  example,  the  three  dots  indicates  that  every  counting  number  is  a  possible 
value  for  X.  Although  it  is  highly  unlikely,  for  example,  that  it  would  take  50  tosses 
of  the  coin  to  observe  heads  for  the  first  time,  nevertheless  it  is  conceivable,  hence 
the  number  50  is  a  possible  value.  The  set  of  possible  values  is  infinite,  but  is  still  at 
least  countable,  in  the  sense  that  all  possible  values  can  be  listed  one  after  another. 
In  the  last  two  examples,  by  way  of  contrast,  the  possible  values  cannot  be 
individually  listed,  but  take  up  a  whole  interval  of  numbers.  In  the  fourth  example, 
since  the  light  bulb  could  conceivably  continue  to  shine  indefinitely,  there  is  no 
natural  greatest  value  for  its  lifetime,  so  we  simply  place  the  symbol  <x>  for  infinity 
as  the  right  endpoint  of  the  interval  of  possible  values. 


Table  4.1  Four  Random  Variables 


1.  A  numerical  value  generated 
by  a  random  experiment. 


Experiment 

Number  X 

Possible  Values  of 

X 

Roll  two  fair  dice 

Sum  of  the  number  of  dots  on  the 
top  faces 

2,  3,  4,  5,  6,  7,  8,  9, 

10,  11, 12 
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Experiment 

Number  X 

Possible  Values  of 

X 

Flip  a  fair  coin  repeatedly 

Number  of  tosses  until  the  coin 
lands  heads 

1,2,  3,4,  ... 

Measure  the  voltage  at  an 
electrical  outlet 

Voltage  measured 

118  s  x  s  122 

Operate  a  light  bulb  until  it 
burns  out 

Time  until  the  bulb  burns  out 

0  <  x  <  ~ 

Definition 

A  random  variable  is  called  discrete2  if  it  has  either  a  finite  or  a  countable  number  of 
possible  values.  A  random  variable  is  called  continuous3  if  its  possible  values  contain 
a  whole  interval  of  numbers. 


The  examples  in  the  table  are  typical  in  that  discrete  random  variables  typically 
arise  from  a  counting  process,  whereas  continuous  random  variables  typically  arise 
from  a  measurement. 


KEY  TAKEAWAYS 


•  A  random  variable  is  a  number  generated  by  a  random  experiment. 

•  A  random  variable  is  called  discrete  if  its  possible  values  form  a  finite  or 
countable  set. 

•  A  random  variable  is  called  continuous  if  its  possible  values  contain  a 
whole  interval  of  numbers. 


2.  A  random  variable  with  a  finite 
or  countable  number  of 
possible  values. 

3.  A  random  variable  whose 
possible  values  contain  an 
interval  of  decimal  numbers. 
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1.  Classify  each  random  variable  as  either  discrete  or  continuous. 

a.  The  number  of  arrivals  at  an  emergency  room  between  midnight  and  6:00 

a.m. 

b.  The  weight  of  a  box  of  cereal  labeled  “18  ounces.” 

c.  The  duration  of  the  next  outgoing  telephone  call  from  a  business  office. 

d.  The  number  of  kernels  of  popcorn  in  a  1-pound  container. 

e.  The  number  of  applicants  for  a  job. 

2.  Classify  each  random  variable  as  either  discrete  or  continuous. 

a.  The  time  between  customers  entering  a  checkout  lane  at  a  retail  store. 

b.  The  weight  of  refuse  on  a  truck  arriving  at  a  landfill. 

c.  The  number  of  passengers  in  a  passenger  vehicle  on  a  highway  at  rush 
hour. 

d.  The  number  of  clerical  errors  on  a  medical  chart. 

e.  The  number  of  accident-free  days  in  one  month  at  a  factory. 

3.  Classify  each  random  variable  as  either  discrete  or  continuous. 

a.  The  number  of  boys  in  a  randomly  selected  three-child  family. 

b.  The  temperature  of  a  cup  of  coffee  served  at  a  restaurant. 

c.  The  number  of  no-shows  for  every  100  reservations  made  with  a 
commercial  airline. 

d.  The  number  of  vehicles  owned  by  a  randomly  selected  household. 

e.  The  average  amount  spent  on  electricity  each  July  by  a  randomly  selected 
household  in  a  certain  state. 

4.  Classify  each  random  variable  as  either  discrete  or  continuous. 

a.  The  number  of  patrons  arriving  at  a  restaurant  between  5:00  p.m.  and  6:00 
p.m. 

b.  The  number  of  new  cases  of  influenza  in  a  particular  county  in  a  coming 
month. 

c.  The  air  pressure  of  a  tire  on  an  automobile. 

d.  The  amount  of  rain  recorded  at  an  airport  one  day. 

e.  The  number  of  students  who  actually  register  for  classes  at  a  university 
next  semester. 

5.  Identify  the  set  of  possible  values  for  each  random  variable.  (Make  a 
reasonable  estimate  based  on  experience,  where  necessary.) 
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a.  The  number  of  heads  in  two  tosses  of  a  coin. 

b.  The  average  weight  of  newborn  babies  born  in  a  particular  county  one 
month. 

c.  The  amount  of  liquid  in  a  12-ounce  can  of  soft  drink. 

d.  The  number  of  games  in  the  next  World  Series  (best  of  up  to  seven  games). 

e.  The  number  of  coins  that  match  when  three  coins  are  tossed  at  once. 

6.  Identify  the  set  of  possible  values  for  each  random  variable.  (Make  a 

reasonable  estimate  based  on  experience,  where  necessary.) 

a.  The  number  of  hearts  in  a  five-card  hand  drawn  from  a  deck  of  52  cards 
that  contains  13  hearts  in  all. 

b.  The  number  of  pitches  made  by  a  starting  pitcher  in  a  major  league 
baseball  game. 

c.  The  number  of  breakdowns  of  city  buses  in  a  large  city  in  one  week. 

d.  The  distance  a  rental  car  rented  on  a  daily  rate  is  driven  each  day. 

e.  The  amount  of  rainfall  at  an  airport  next  month. 


ANSWERS 


1. 

a. 

discrete 

b. 

continuous 

c. 

continuous 

d. 

discrete 

e. 

discrete 

3. 

a. 

discrete 

b. 

continuous 

c. 

discrete 

d. 

discrete 

e. 

continuous 

5. 

a. 

{0.1.2} 

b. 

an  interval  1 

[a,  b^j  (answers  vary) 

c. 

an  interval  1 

[a,  b^j  (answers  vary) 

d. 

14,5,6,7} 

e. 

{2,3} 

4.1  Random  Variables 


184 


Chapter  4  Discrete  Random  Variables 


4.2  Probability  Distributions  for  Discrete  Random  Variables 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  probability  distribution  of  a  discrete  random 
variable. 

2.  To  learn  the  concepts  of  the  mean,  variance,  and  standard  deviation  of  a 
discrete  random  variable,  and  how  to  compute  them. 


Probability  Distributions 

Associated  to  each  possible  value  x  of  a  discrete  random  variable  X  is  the  probability 
P  (x)  that  X  will  take  the  value  x  in  one  trial  of  the  experiment. 


Definition 

The  probability  distribution4  of  a  discrete  random  variable  X  is  a  list  of  each 
possible  value  ofX  together  with  the  probability  that  X  takes  that  value  in  one  trial  of 
the  experiment. 


The  probabilities  in  the  probability  distribution  of  a  random  variable  X  must  satisfy 
the  following  two  conditions: 


1.  Each  probability  P  (x)  must  be  between  0  and  1:  0  <  P  (x)  <  1 . 

2.  The  sum  of  all  the  probabilities  is  1:  EP(x)  =  1 . 


4.  A  list  of  each  possible  value 
and  its  probability. 
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EXAMPLE  1 


A  fair  coin  is  tossed  twice.  Let  X  be  the  number  of  heads  that  are  observed. 

a.  Construct  the  probability  distribution  of  X. 

b.  Find  the  probability  that  at  least  one  head  is  observed. 

Solution: 


a.  The  possible  values  that  X  can  take  are  0, 1,  and  2.  Each  of  these 
numbers  corresponds  to  an  event  in  the  sample  space 
S  =  {hh,  ht,  th,  tt }  of  equally  likely  outcomes  for  this 
experiment:  X  =  0  to  {tt },  X  =  1  to  {ht,  th  } ,  and  X  =  2  to  {hh  }  . 
The  probability  of  each  of  these  events,  hence  of  the 
corresponding  value  of  X,  can  be  found  simply  by  counting,  to 
give 


X 

0 

1  2 

P(x) 

0.25 

0.50  0.25 

This  table  is  the  probability  distribution  of  X. 

b.  “At  least  one  head”  is  the  event  X  a  1,  which  is  the  union  of  the 
mutually  exclusive  events  X  =  1  and  X  =  2.  Thus 

P  (X  >  1)  =  P  (1)  +  P  (2)  =  0.50  4-  0.25  =  0.75 

A  histogram  that  graphically  illustrates  the  probability 
distribution  is  given  in  Figure  4.1  "Probability  Distribution  for 
Tossing  a  Fair  Coin  Twice". 
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Figure  4.1 

Probability  Distribution  for  Tossing  a  Fair  Coin  Twice 
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EXAMPLE  2 


A  pair  of  fair  dice  is  rolled.  Let  X  denote  the  sum  of  the  number  of  dots  on 
the  top  faces. 

a.  Construct  the  probability  distribution  of  X. 

b.  Find  P(X  2  9). 

c.  Find  the  probability  that  X  takes  an  even  value. 

Solution: 

The  sample  space  of  equally  likely  outcomes  is 


11 

12 

13 

14 

15 

16 

21 

22 

23 

24 

25 

26 

31 

32 

33 

34 

35 

36 

41 

42 

43 

44 

45 

46 

51 

52 

53 

54 

55 

56 

61 

62 

63 

64 

65 

66 

a.  The  possible  values  for  X  are  the  numbers  2  through  12.  X  =  2  is 
the  event  {ll},  so  P  (2)  =  1  /  36.  X  =  3  is  the  event  {l2,2l},  so 
P  (3)  =  2  /  36.  Continuing  this  way  we  obtain  the  table 


X 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

P(X) 

1 

2 

3 

4 

5 

6 

5 

4 

3 

2 

1 

36 

36 

36 

36 

36 

36 

36 

36 

36 

36 

36 

This  table  is  the  probability  distribution  of  X. 

b.  The  event  X  2  9  is  the  union  of  the  mutually  exclusive  events  X  = 
9,  X  =  10,  X  =  11,  and  X  =  12.  Thus 

P(X>  9)  =  P(9)  +  P(10)  +  P(11)  +  P(12)  = 
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c.  Before  we  immediately  jump  to  the  conclusion  that  the 

probability  that  X  takes  an  even  value  must  be  0.5,  note  that  X 
takes  six  different  even  values  but  only  five  different  odd  values. 

We  compute 

P  (X  is  even)  =  P(2)  +  P(4)  +  P  (6)  +  P(8)  +  P(10)  +  P(12) 

1  3  5  5  3  1  _  18  _ 

36  +  36  +  36+36  +  36+36_36“ 


A  histogram  that  graphically  illustrates  the  probability 
distribution  is  given  in  Figure  4.2  "Probability  Distribution  for 
Tossing  Two  Fair  Dice". 


Figure  4.2 

Probability  Distribution  for  Tossing  Two  Fair  Dice 
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The  Mean  and  Standard  Deviation  of  a  Discrete  Random  Variable 


Definition 

The  mean5  (also  called  the  expected  value6)  of  a  discrete  random  variable  X  is  the 
number 

pt  =  E(X)  =  ZxP(x) 

The  mean  of  a  random  variable  may  be  interpreted  as  the  average  of  the  values 
assumed  by  the  random  variable  in  repeated  trials  of  the  experiment. 


EXAMPLE  3 


Find  the  mean  of  the  discrete  random  variable  X  whose  probability 
distribution  is 


X 

-2 

1  2 

3.5 

P(x) 

0.21 

0.34  0.24 

0.21 

Solution: 


The  formula  in  the  definition  gives 
/it  =  1jxP(x) 

=  (-2)  •  0.21  +  (1)  •  0.34  +  (2)  •  0.24  +  (3.5)  •  0.21  =  1.135 


5.  The  number  Zx  P (x )  , 
measuring  its  average  upon 
repeated  trials. 

6.  Its  mean. 
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EXAMPLE  4 


A  service  organization  in  a  large  town  organizes  a  raffle  each  month.  One 
thousand  raffle  tickets  are  sold  for  $1  each.  Each  has  an  equal  chance  of 
winning.  First  prize  is  $300,  second  prize  is  $200,  and  third  prize  is  $100.  Let 
X  denote  the  net  gain  from  the  purchase  of  one  ticket. 

a.  Construct  the  probability  distribution  of  X. 

b.  Find  the  probability  of  winning  any  money  in  the  purchase  of  one  ticket. 

c.  Find  the  expected  value  of  X,  and  interpret  its  meaning. 

Solution: 


a.  If  a  ticket  is  selected  as  the  first  prize  winner,  the  net  gain  to  the 
purchaser  is  the  $300  prize  less  the  $1  that  was  paid  for  the 
ticket,  hence  X  =  300  -  1  =  299.  There  is  one  such  ticket,  so  P( 299) 
=  0.001.  Applying  the  same  “income  minus  outgo”  principle  to 
the  second  and  third  prize  winners  and  to  the  997  losing  tickets 
yields  the  probability  distribution: 


X 

299 

199 

99  -1 

P(x) 

0.001 

0.001 

0.001  0.997 

b.  Let  W  denote  the  event  that  a  ticket  is  selected  to  win  one  of  the 
prizes.  Using  the  table 

P  (W)  =  P  (299)  +  P  (199)  +  P  (99)  =  0.001  +  0.001  +  0.001 

c.  Using  the  formula  in  the  definition  of  expected  value, 

E(X)  =  299  •  0.001  +  199  •  0.001  +  99  •  0.001  +  (-1)  •  0.997 

The  negative  value  means  that  one  loses  money  on  the  average. 

In  particular,  if  someone  were  to  buy  tickets  repeatedly,  then 
although  he  would  win  now  and  then,  on  average  he  would  lose 
40  cents  per  ticket  purchased. 


The  concept  of  expected  value  is  also  basic  to  the  insurance  industry,  as  the 
following  simplified  example  illustrates. 
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EXAMPLE  5 


A  life  insurance  company  will  sell  a  $200,000  one-year  term  life  insurance 
policy  to  an  individual  in  a  particular  risk  group  for  a  premium  of  $195.  Find 
the  expected  value  to  the  company  of  a  single  policy  if  a  person  in  this  risk 
group  has  a  99.97%  chance  of  surviving  one  year. 

Solution: 

Let  X  denote  the  net  gain  to  the  company  from  the  sale  of  one  such  policy. 
There  are  two  possibilities:  the  insured  person  lives  the  whole  year  or  the 
insured  person  dies  before  the  year  is  up.  Applying  the  “income  minus 
outgo”  principle,  in  the  former  case  the  value  of  X  is  195  -  0;  in  the  latter 

case  it  is  195  —  200,000  =  —199,805.  Since  the  probability  in  the 
first  case  is  0.9997  and  in  the  second  case  is  1  —  0.9997  =  0.0003  ,  the 
probability  distribution  for  X  is: 


X 

195 

-199,805 

P(X) 

0.9997 

0.0003 

Therefore 

E(X)  =  ExP(x)  =  195  •  0.9997  +  (-199,805)  •  0.0003  =  135 

Occasionally  (in  fact,  3  times  in  10,000)  the  company  loses  a  large  amount  of 
money  on  a  policy,  but  typically  it  gains  $195,  which  by  our  computation  of 
E  (X)  works  out  to  a  net  gain  of  $135  per  policy  sold,  on  average. 
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Definition 


The  variance,  o2 ,  of  a  discrete  random  variable  X  is  the  number 

a2  =  —  //)2  P  (x) 

which  by  algebra  is  equivalent  to  the  formula 


Definition 


The  standard  deviation7,  a,  of  a  discrete  random  variable  X  is  the  square  root  of  its 
variance,  hence  is  given  by  the  formulas 


a  = 


y/^(x  ~  n)2  P (*) 


The  variance  and  standard  deviation  of  a  discrete  random  variable  X  may  be 
interpreted  as  measures  of  the  variability  of  the  values  assumed  by  the  random 
variable  in  repeated  trials  of  the  experiment.  The  units  on  the  standard  deviation 
match  those  of  X. 


7.  The  number 


(also 


computed  using 

yJ\Lx2P(x)  ]  -  ft2), 


measuring  its  variability  under 
repeated  trials. 
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EXAMPLE  6 


A  discrete  random  variable  X  has  the  following  probability  distribution: 


X 

-1  0 

1  4 

P(x) 

0.2  0.5 

ci  0.1 

A  histogram  that  graphically  illustrates  the  probability  distribution  is  given 
in  Figure  4.3  "Probability  Distribution  of  a  Discrete  Random  Variable". 

Figure  4.3 

Probability  Distribution  of  a  Discrete  Random  Variable 


Compute  each  of  the  following  quantities. 


a.  a. 

b.  P{ 0). 

c.  P(X>0). 

d.  P(X  a  0). 

e.  P  (X  <  -2)  . 

f.  The  mean  ft  of  X. 

g.  The  variance  C2  of  X. 

h.  The  standard  deviation  a  of  X. 
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Solution: 

a.  Since  all  probabilities  must  add  up  to  1, 

a  =  1  -  (0.2 +  0.5  +  0.1)  =  0.2. 

b.  Directly  from  the  table,  P  (0)  =  0.5. 

c.  From  the  table,  P  (X  >  0)  =  P  (1)  +  P  (4)  =  0.2  +  0.1  =  0.3. 

d.  From  the  table, 

P  (X  >  0)  =  P  (0)  +  P  (1)  +  P  (4)  =  0.5  +  0.2  +  0.1  =  0.8. 

e.  Since  none  of  the  numbers  listed  as  possible  values  for  X  is  less  than  or 
equal  to  -2,  the  event  X  s  -2  is  impossible,  so  P(X  s  -2)  =  0. 

f.  Using  the  formula  in  the  definition  of 

\i  =  £xP(x)  =  (-1)  -0.2  +  0  •  0.5  +  1  -  0.2  +  4  •  0.1  =  0.4 

9 

g.  Using  the  formula  in  the  definition  of  G  and  the  value  of  /./  that 
was  just  computed, 

a1  —  I \(x  —  n)~  P  (x) 

=  (-1  -  0.4) 2  •  0.2  +  (0  -  0.4)2  •  0.5  +  (1  -  0.4)2  •  0.2 
=  1.84 

h.  Using  the  result  of  part  (g),  G  =  \J  1.84  =  1.3565. 


KEY  TAKEAWAYS 


•  The  probability  distribution  of  a  discrete  random  variable  X  is  a  listing 
of  each  possible  value  x  taken  by  X  along  with  the  probability  P  (x)  that 
X  takes  that  value  in  one  trial  of  the  experiment. 

•  The  mean  /j  of  a  discrete  random  variable  X  is  a  number  that  indicates 
the  average  value  of  X  over  numerous  trials  of  the  experiment.  It  is 
computed  using  the  formula  //  =  Xx  P  (x)  . 

2 

•  The  variance  G  and  standard  deviation  a  of  a  discrete  random  variable 
X  are  numbers  that  indicate  the  variability  of  X  over  numerous  trials  of 
the  experiment.  They  may  be  computed  using  the  formula 

G2  =  [Sx2P(  x)  j  —  /l2,  taking  the  square  root  to  obtain  a. 
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EXERCISES 


BASIC 


1.  Determine  whether  or  not  the  table  is  a  valid  probability  distribution  of  a 
discrete  random  variable.  Explain  fully. 


a. 


X 

-2  0  2  4 

P(x) 

0.3  0.5  0.2  0.1 

b. 


X 

0.5 

0.25 

0.25 

P(x) 

-0.4 

0.6 

0.8 

c. 


X 

1.1  2.5 

4.1 

4.6 

5.3 

P(x) 

0.16  0.14 

0.11 

0.27 

0.22 

2.  Determine  whether  or  not  the  table  is  a  valid  probability  distribution  of  a 
discrete  random  variable.  Explain  fully. 


X 

0 

1  2 

3 

4 

P(x) 

-0.25 

0.50  0.35 

0.10 

0.30 

X 

1 

2 

3 

P(x) 

0.325 

0.406 

0.164 

X 

25 

26 

27 

28 

29 

P(x) 

0.13 

0.27 

0.28 

0.18 

0.14 
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3.  A  discrete  random  variable  X  has  the  following  probability  distribution: 


X 

77 

78 

79 

80 

81 

P(x) 

0.15 

0.15 

0.20 

0.40 

0.10 

Compute  each  of  the  following  quantities. 

a.  P  (80)  . 

b.  P(X>  80). 

c.  P(X  s  80). 

d.  The  mean  ju  of  X. 

e.  The  variance  <7~  of  X. 

f.  The  standard  deviation  a  of  X. 

4.  A  discrete  random  variable  X  has  the  following  probability  distribution: 


X 

13 

18 

20 

24 

27 

P(x) 

0.22 

0.25 

0.20 

0.17 

0.16 

Compute  each  of  the  following  quantities. 

a.  P  (18)  . 

b.  P(X  >  18). 

c.  P(X  s  18). 

d.  The  mean  /i  of  X. 

e.  The  variance  £7“  of  X. 

f.  The  standard  deviation  a  of  X. 


5.  If  each  die  in  a  pair  is  “loaded”  so  that  one  comes  up  half  as  often  as  it  should, 
six  comes  up  half  again  as  often  as  it  should,  and  the  probabilities  of  the  other 
faces  are  unaltered,  then  the  probability  distribution  for  the  sum  X  of  the 
number  of  dots  on  the  top  faces  when  the  two  are  rolled  is 


X 

2 

3 

4 

5 

6 

7 

P(x) 

1 

4 

8 

12 

16 

22 

144 

144 

144 

144 

144 

144 

X 


P(x) 


8  9  10  11  12 


24  20  16  12  9 


144  144  144  144  144 


Compute  each  of  the  following. 


a.  P(  5  <X<9)  . 

b.  P(X&7). 
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c.  The  mean  /jl  of  X.  (For  fair  dice  this  number  is  7.) 

d.  The  standard  deviation  a  of  X.  (For  fair  dice  this  number  is  about  2.415.) 


APPLICATIONS 


6.  Borachio  works  in  an  automotive  tire  factory.  The  number  X  of  sound  but 
blemished  tires  that  he  produces  on  a  random  day  has  the  probability 
distribution 


X 

2 

3 

4 

5 

P(x) 

0.48 

0.36 

0.12 

0.04 

a.  Find  the  probability  that  Borachio  will  produce  more  than  three  blemished 
tires  tomorrow. 

b.  Find  the  probability  that  Borachio  will  produce  at  most  two  blemished 
tires  tomorrow. 

c.  Compute  the  mean  and  standard  deviation  of  X.  Interpret  the  mean  in  the 
context  of  the  problem. 

7.  In  a  hamster  breeder's  experience  the  number  X  of  live  pups  in  a  litter  of  a 
female  not  over  twelve  months  in  age  who  has  not  borne  a  litter  in  the  past  six 
weeks  has  the  probability  distribution 


X 

3 

4 

5 

6 

7 

8 

9 

P(x) 

0.04 

0.10 

0.26 

0.31 

0.22 

0.05 

0.02 

a.  Find  the  probability  that  the  next  litter  will  produce  five  to  seven  live 
pups. 

b.  Find  the  probability  that  the  next  litter  will  produce  at  least  six  live  pups. 

c.  Compute  the  mean  and  standard  deviation  of  X.  Interpret  the  mean  in  the 
context  of  the  problem. 


8.  The  number  X  of  days  in  the  summer  months  that  a  construction  crew  cannot 
work  because  of  the  weather  has  the  probability  distribution 


X 

6 

7 

8 

9 

10 

P(X) 

0.03 

0.08 

0.15 

0.20 

0.19 

x  11  12  13  14 

P(x)  0.16  0.10  0.07  0.02 


a.  Find  the  probability  that  no  more  than  ten  days  will  be  lost  next  summer. 

b.  Find  the  probability  that  from  8  to  12  days  will  be  lost  next  summer. 
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c.  Find  the  probability  that  no  days  at  all  will  be  lost  next  summer. 

d.  Compute  the  mean  and  standard  deviation  of  X.  Interpret  the  mean  in  the 
context  of  the  problem. 

9.  Let  X  denote  the  number  of  boys  in  a  randomly  selected  three-child  family. 
Assuming  that  boys  and  girls  are  equally  likely,  construct  the  probability 
distribution  of  X. 

10.  Let  X  denote  the  number  of  times  a  fair  coin  lands  heads  in  three  tosses. 
Construct  the  probability  distribution  of  X. 

11.  Five  thousand  lottery  tickets  are  sold  for  $1  each.  One  ticket  will  win  $1,000, 
two  tickets  will  win  $500  each,  and  ten  tickets  will  win  $100  each.  Let  X  denote 
the  net  gain  from  the  purchase  of  a  randomly  selected  ticket. 

a.  Construct  the  probability  distribution  of  X. 

b.  Compute  the  expected  value  E  (X)  of  X.  Interpret  its  meaning. 

c.  Compute  the  standard  deviation  o  of  X. 

12.  Seven  thousand  lottery  tickets  are  sold  for  $5  each.  One  ticket  will  win  $2,000, 
two  tickets  will  win  $750  each,  and  five  tickets  will  win  $100  each.  Let  X  denote 
the  net  gain  from  the  purchase  of  a  randomly  selected  ticket. 

a.  Construct  the  probability  distribution  of  X. 

b.  Compute  the  expected  value  E  (X)  of  X.  Interpret  its  meaning. 

c.  Compute  the  standard  deviation  a  of  X. 

13.  An  insurance  company  will  sell  a  $90,000  one-year  term  life  insurance  policy  to 
an  individual  in  a  particular  risk  group  for  a  premium  of  $478.  Find  the 
expected  value  to  the  company  of  a  single  policy  if  a  person  in  this  risk  group 
has  a  99.62%  chance  of  surviving  one  year. 

14.  An  insurance  company  will  sell  a  $10,000  one-year  term  life  insurance  policy  to 
an  individual  in  a  particular  risk  group  for  a  premium  of  $368.  Find  the 
expected  value  to  the  company  of  a  single  policy  if  a  person  in  this  risk  group 
has  a  97.25%  chance  of  surviving  one  year. 

15.  An  insurance  company  estimates  that  the  probability  that  an  individual  in  a 
particular  risk  group  will  survive  one  year  is  0.9825.  Such  a  person  wishes  to 
buy  a  $150,000  one-year  term  life  insurance  policy.  Let  C  denote  how  much  the 
insurance  company  charges  such  a  person  for  such  a  policy. 

a.  Construct  the  probability  distribution  of  X.  (Two  entries  in  the  table  will 
contain  C.) 

b.  Compute  the  expected  value  E  (X)  of  X. 
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c.  Determine  the  value  C  must  have  in  order  for  the  company  to  break  even 
on  all  such  policies  (that  is,  to  average  a  net  gain  of  zero  per  policy  on  such 
policies). 

d.  Determine  the  value  C  must  have  in  order  for  the  company  to  average  a  net 
gain  of  $250  per  policy  on  all  such  policies. 

16.  An  insurance  company  estimates  that  the  probability  that  an  individual  in  a 
particular  risk  group  will  survive  one  year  is  0.99.  Such  a  person  wishes  to  buy 
a  $75,000  one-year  term  life  insurance  policy.  Let  C  denote  how  much  the 
insurance  company  charges  such  a  person  for  such  a  policy. 

a.  Construct  the  probability  distribution  of  X.  (Two  entries  in  the  table  will 
contain  C.) 

b.  Compute  the  expected  value  E  (X)  of  X. 

c.  Determine  the  value  C  must  have  in  order  for  the  company  to  break  even 
on  all  such  policies  (that  is,  to  average  a  net  gain  of  zero  per  policy  on  such 
policies). 

d.  Determine  the  value  C  must  have  in  order  for  the  company  to  average  a  net 
gain  of  $150  per  policy  on  all  such  policies. 

17.  A  roulette  wheel  has  38  slots.  Thirty-six  slots  are  numbered  from  1  to  36;  half 
of  them  are  red  and  half  are  black.  The  remaining  two  slots  are  numbered  0 
and  00  and  are  green.  In  a  $1  bet  on  red,  the  bettor  pays  $1  to  play,  if  the  ball 
lands  in  a  red  slot,  he  receives  back  the  dollar  he  bet  plus  an  additional  dollar. 
If  the  ball  does  not  land  on  red  he  loses  his  dollar.  Let  X  denote  the  net  gain  to 
the  bettor  on  one  play  of  the  game. 

a.  Construct  the  probability  distribution  of  X. 

b.  Compute  the  expected  value  E  ( X )  of  X,  and  interpret  its  meaning  in  the 
context  of  the  problem. 

c.  Compute  the  standard  deviation  of  X. 

18.  A  roulette  wheel  has  38  slots.  Thirty-six  slots  are  numbered  from  1  to  36;  the 
remaining  two  slots  are  numbered  0  and  00.  Suppose  the  “number”  00  is 
considered  not  to  be  even,  but  the  number  0  is  still  even.  In  a  $1  bet  on  even, 
the  bettor  pays  $1  to  play,  if  the  ball  lands  in  an  even  numbered  slot,  he 
receives  back  the  dollar  he  bet  plus  an  additional  dollar,  if  the  ball  does  not 
land  on  an  even  numbered  slot,  he  loses  his  dollar.  Let  X  denote  the  net  gain  to 
the  bettor  on  one  play  of  the  game. 

a.  Construct  the  probability  distribution  of  X. 

b.  Compute  the  expected  value  E  (X)  of  X,  and  explain  why  this  game  is  not 
offered  in  a  casino  (where  0  is  not  considered  even). 

c.  Compute  the  standard  deviation  of  X. 
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19.  The  time,  to  the  nearest  whole  minute,  that  a  city  bus  takes  to  go  from  one  end 
of  its  route  to  the  other  has  the  probability  distribution  shown.  As  sometimes 
happens  with  probabilities  computed  as  empirical  relative  frequencies, 
probabilities  in  the  table  add  up  only  to  a  value  other  than  1.00  because  of 
round-off  error. 


X 

42 

43 

44 

45 

46 

47 

P(X) 

0.10 

0.23 

0.34 

0.25 

0.05 

0.02 

a.  Find  the  average  time  the  bus  takes  to  drive  the  length  of  its  route. 

b.  Find  the  standard  deviation  of  the  length  of  time  the  bus  takes  to  drive  the 
length  of  its  route. 

20.  Tybalt  receives  in  the  mail  an  offer  to  enter  a  national  sweepstakes.  The  prizes 
and  chances  of  winning  are  listed  in  the  offer  as:  $5  million,  one  chance  in  65 
million;  $150,000,  one  chance  in  6.5  million;  $5,000,  one  chance  in  650,000;  and 
$1,000,  one  chance  in  65,000.  if  it  costs  Tybalt  44  cents  to  mail  his  entry,  what  is 
the  expected  value  of  the  sweepstakes  to  him? 


ADDITIONAL  EXERCISES 


21.  The  number  X  of  nails  in  a  randomly  selected  1-pound  box  has  the  probability 
distribution  shown.  Find  the  average  number  of  nails  per  pound. 


X 

100  101  102 

P(X) 

0.01  0.96  0.03 

22.  Three  fair  dice  are  rolled  at  once.  Let  X  denote  the  number  of  dice  that  land 
with  the  same  number  of  dots  on  top  as  at  least  one  other  die.  The  probability 
distribution  for  X  is 


X 

0  u 

3 

Pipe) 

15 
P  36 

1 

36 

a.  Find  the  missing  value  u  of  X. 

b.  Find  the  missing  probability  p. 

c.  Compute  the  mean  of  X. 

d.  Compute  the  standard  deviation  of  X. 

23.  Two  fair  dice  are  rolled  at  once.  Let  X  denote  the  difference  in  the  number  of 
dots  that  appear  on  the  top  faces  of  the  two  dice.  Thus  for  example  if  a  one  and 
a  five  are  rolled,  X  =  4,  and  if  two  sixes  are  rolled,  X  =  0. 
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a.  Construct  the  probability  distribution  for  X. 

b.  Compute  the  mean  /i  of  X. 

c.  Compute  the  standard  deviation  a  of  X. 

24.  A  fair  coin  is  tossed  repeatedly  until  either  it  lands  heads  or  a  total  of  five 
tosses  have  been  made,  whichever  comes  first.  Let  X  denote  the  number  of 
tosses  made. 

a.  Construct  the  probability  distribution  for  X. 

b.  Compute  the  mean  /i  of  X. 

c.  Compute  the  standard  deviation  o  of  X. 

25.  A  manufacturer  receives  a  certain  component  from  a  supplier  in  shipments  of 
100  units.  Two  units  in  each  shipment  are  selected  at  random  and  tested,  if 
either  one  of  the  units  is  defective  the  shipment  is  rejected.  Suppose  a 
shipment  has  5  defective  units. 

a.  Construct  the  probability  distribution  for  the  number  X  of  defective  units 
in  such  a  sample.  (A  tree  diagram  is  helpful.) 

b.  Find  the  probability  that  such  a  shipment  will  be  accepted. 

26.  Shylock  enters  a  local  branch  bank  at  4:30  p.m.  every  payday,  at  which  time 
there  are  always  two  tellers  on  duty.  The  number  X  of  customers  in  the  bank 
who  are  either  at  a  teller  window  or  are  waiting  in  a  single  line  for  the  next 
available  teller  has  the  following  probability  distribution. 

x  I  0  1  2  3 

P(x)  0.135  0.192  0.284  0.230 


X 

4 

5 

6 

P(x) 

0.103 

0.051 

0.005 

a.  What  number  of  customers  does  Shylock  most  often  see  in  the  bank  the 
moment  he  enters? 

b.  What  number  of  customers  waiting  in  line  does  Shylock  most  often  see  the 
moment  he  enters? 

c.  What  is  the  average  number  of  customers  who  are  waiting  in  line  the 
moment  Shylock  enters? 

27.  The  owner  of  a  proposed  outdoor  theater  must  decide  whether  to  include  a 

cover  that  will  allow  shows  to  be  performed  in  all  weather  conditions.  Based  on 
projected  audience  sizes  and  weather  conditions,  the  probability  distribution 
for  the  revenue  X  per  night  if  the  cover  is  not  installed  is 
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Weather 

X 

P(x) 

Clear 

$3000 

0.61 

Threatening 

$2800 

0.17 

Light  rain 

$1975 

0.11 

Show-cancelling  rain 

$0 

0.11 

The  additional  cost  of  the  cover  is  $410,000.  The  owner  will  have  it  built  if  this 
cost  can  be  recovered  from  the  increased  revenue  the  cover  affords  in  the  first 
ten  90-night  seasons. 


a.  Compute  the  mean  revenue  per  night  if  the  cover  is  not  installed. 

b.  Use  the  answer  to  (a)  to  compute  the  projected  total  revenue  per  90-night 
season  if  the  cover  is  not  installed. 

c.  Compute  the  projected  total  revenue  per  season  when  the  cover  is  in  place. 
To  do  so  assume  that  if  the  cover  were  in  place  the  revenue  each  night  of 
the  season  would  be  the  same  as  the  revenue  on  a  clear  night. 

d.  Using  the  answers  to  (b)  and  (c),  decide  whether  or  not  the  additional  cost 
of  the  installation  of  the  cover  will  be  recovered  from  the  increased 
revenue  over  the  first  ten  years.  Will  the  owner  have  the  cover  installed? 
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ANSWERS 


1.  a.  no:  the  sum  of  the  probabilities  exceeds  1 

b.  no:  a  negative  probability 

c.  no:  the  sum  of  the  probabilities  is  less  than  1 

3.  a.  0.4 

b.  0.1 

c.  0.9 

d.  79.15 

e.  t72  =  1.5275 

f.  0=  1.2359 

5.  a.  0.6528 

b.  0.7153 

c.  /i  =  7.8333 

d.  <72  =  5.4866 

e.  a=  2.3424 

7.  a.  0.79 

b.  0.60 

c.  /J  =  5.8,  CT=  1.2570 


X 

0 

1  2 

3 

P(x) 

1/8 

3/8  3/8 

1/8 

11.  a. 


X 

-1 

999 

499 

99 

P(x) 

4987 

5000 

l 

5000 

2 

5000 

10 

5000 

b.  -0.4 

c.  17.8785 


13.  136 


15.  a. 


X 

C 

C- 150,000 

P(x) 

0.9825 

0.0175 
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b.  C- 2625 

c.  C  2  2625 

d.  C s 2875 


17.  a. 


X 

-1 

1 

P(x) 

20 

38 

18 

38 

b.  E  (X)  =  —0.0526  In  many  bets  the  bettor  sustains  an  average  loss  of 
about  5.25  cents  per  bet. 

c.  0.9986 

19.  a.  43.54 
b.  1.2046 

21.  101.02 


23. 


a. 


X 


0 


1  2  3  4  5 


Hill 

36  36  36  36  36 


b.  1.9444 

c.  1.4326 


25.  a. 


X 

0  1  2 

P(x) 

0.902  0.096  0.002 

b.  0.902 

27.  a.  2523.25 

b.  227,092.5 

c.  270,000 

d.  The  owner  will  install  the  cover. 
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4.3  The  Binomial  Distribution 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  a  binomial  random  variable. 

2.  To  learn  how  to  recognize  a  random  variable  as  being  a  binomial 
random  variable. 


The  experiment  of  tossing  a  fair  coin  three  times  and  the  experiment  of  observing 
the  genders  according  to  birth  order  of  the  children  in  a  randomly  selected  three- 
child  family  are  completely  different,  but  the  random  variables  that  count  the 
number  of  heads  in  the  coin  toss  and  the  number  of  boys  in  the  family  (assuming 
the  two  genders  are  equally  likely)  are  the  same  random  variable,  the  one  with 
probability  distribution 


X 

0 

1  2 

3 

P(X) 

0.125 

0.375  0.375 

0.125 

A  histogram  that  graphically  illustrates  this  probability  distribution  is  given  in 
Figure  4.4  "Probability  Distribution  for  Three  Coins  and  Three  Children".  What  is 
common  to  the  two  experiments  is  that  we  perform  three  identical  and 
independent  trials  of  the  same  action,  each  trial  has  only  two  outcomes  (heads  or 
tails,  boy  or  girl),  and  the  probability  of  success  is  the  same  number,  0.5,  on  every 
trial.  The  random  variable  that  is  generated  is  called  the  binomial  random 
variable8  with  parameters  n  =  3  and  p  =  0.5.  This  is  just  one  case  of  a  general 
situation. 


8.  A  random  variable  that  counts 
successes  in  a  fixed  number  of 
independent,  identical  trials  of 
a  success/failure  experiment. 
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Figure  4.4  Probability  Distribution  for  Three  Coins  and  Three  Children 


Definition 

Suppose  a  random  experiment  has  the  following  characteristics. 

1.  There  are  n  identical  and  independent  trials  of  a  common  procedure. 

2.  There  are  exactly  two  possible  outcomes  for  each  trial,  one  termed 
“success”  and  the  other  “failure.” 

3.  The  probability  of  success  on  any  one  trial  is  the  same  number  p. 

Then  the  discrete  random  variable  X  that  counts  the  number  of  successes  in  the  n  trials 

is  the  binomial  random  variable  with  parameters  n  and  p.  We  also  say  that  X 
has  a  binomial  distribution  with  parameters  n  and  p. 


The  following  four  examples  illustrate  the  definition.  Note  how  in  every  case 
“success”  is  the  outcome  that  is  counted,  not  the  outcome  that  we  prefer  or  think  is 
better  in  some  sense. 
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1.  A  random  sample  of  125  students  is  selected  from  a  large  college  in 
which  the  proportion  of  students  who  are  females  is  57%.  Suppose  X 
denotes  the  number  of  female  students  in  the  sample.  In  this  situation 
there  are  n  =  125  identical  and  independent  trials  of  a  common 
procedure,  selecting  a  student  at  random;  there  are  exactly  two 
possible  outcomes  for  each  trial,  “success”  (what  we  are  counting,  that 
the  student  be  female)  and  “failure;”  and  finally  the  probability  of 
success  on  any  one  trial  is  the  same  number  p  =  0.57.  X  is  a  binomial 
random  variable  with  parameters  n  =  125  and  p  =  0.57. 

2.  A  multiple-choice  test  has  15  questions,  each  of  which  has  five  choices. 
An  unprepared  student  taking  the  test  answers  each  of  the  questions 
completely  randomly  by  choosing  an  arbitrary  answer  from  the  five 
provided.  Suppose  X  denotes  the  number  of  answers  that  the  student 
gets  right.  X  is  a  binomial  random  variable  with  parameters  n  =  15  and 

p=  1/5  =  0.20. 

3.  In  a  survey  of  1,000  registered  voters  each  voter  is  asked  if  he  intends 
to  vote  for  a  candidate  Titania  Queen  in  the  upcoming  election. 
Suppose  X  denotes  the  number  of  voters  in  the  survey  who  intend  to 
vote  for  Titania  Queen.  X  is  a  binomial  random  variable  with  n  =  1000 
and  p  equal  to  the  true  proportion  of  voters  (surveyed  or  not)  who 
intend  to  vote  for  Titania  Queen. 

4.  An  experimental  medication  was  given  to  30  patients  with  a  certain 
medical  condition.  Suppose  X  denotes  the  number  of  patients  who 
develop  severe  side  effects.  X  is  a  binomial  random  variable  with  n  =  30 
and  p  equal  to  the  true  probability  that  a  patient  with  the  underlying 
condition  will  experience  severe  side  effects  if  given  that  medication. 

Probability  Formula  for  a  Binomial  Random  Variable 

Often  the  most  difficult  aspect  of  working  a  problem  that  involves  the  binomial 
random  variable  is  recognizing  that  the  random  variable  in  question  has  a  binomial 
distribution.  Once  that  is  known,  probabilities  can  be  computed  using  the  following 
formula. 
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if  X  is  a  binomial  random  variable  with  parameters  n  and  p,  then 


P(x)  = 


n\ 


x!  (n  —  x)\ 


px  qn~x 


where  q  —  1  —  p  and  where  for  any  counting  number  m,  m !  (read  “m 
factorial”)  is  defined  by 

0!  =  1,  1!  =  1,  2!  =  1  -2,  3!  =  1  •  2  •  3 

and  in  general 

m\  =  1  •  2  •  •  •  (m—  1 )  •  m 
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EXAMPLE  7 


Seventeen  percent  of  victims  of  financial  fraud  know  the  perpetrator  of  the 

fraud  personally. 

a.  Use  the  formula  to  construct  the  probability  distribution  for  the  number 
X  of  people  in  a  random  sample  of  five  victims  of  financial  fraud  who 
knew  the  perpetrator  personally. 

b.  A  investigator  examines  five  cases  of  financial  fraud  every  day.  Find  the 
most  frequent  number  of  cases  each  day  in  which  the  victim  knew  the 
perpetrator. 

c.  A  investigator  examines  five  cases  of  financial  fraud  every  day.  Find  the 
average  number  of  cases  per  day  in  which  the  victim  knew  the 
perpetrator. 

Solution: 


a.  The  random  variable  X  is  binomial  with  parameters  n  =  5  and  p  = 
0.17;  q  =  1  —  p  =  0.83.  The  possible  values  of  X  are  0, 1,  2,  3, 
4,  and  5. 


P(  0) 


-|j-  (0.17)°  (0.83)5 
1  .  2  •  3  •  4  •  5 

- -  1  •  (0.3939040643) 

(1)-  (1  -  2-3 -4-5)  v  ’ 

0.3939040643  «  0.3939 


P(l) 


4^-  (0.17)'  (0.83) 4 
1  •  2  •  3  •  4  •  5 

- 1 - (0.17)-  (0.47458321) 

(1)  -  (1  -  2  -  3  -  4)  v  ;  v  ; 

5  •  (0.17)  •  (0.47458321)  =  0.4033957285  «  0.4034 
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P(2)  =  (0.17)2  (0.83) 3 

=  <112).  tl  4253>  (0-°289)-  (°-571787) 

=  10-  (0.0289)-  (0.571787)  =  0.165246443  «  0.1652 


The  remaining  three  probabilities  are  computed  similarly,  to  give 
the  probability  distribution 


X 

0 

1  2 

3 

4 

5 

P(x) 

0.3939 

0.4034  0.1652 

0.0338 

0.0035 

0.0001 

The  probabilities  do  not  add  up  to  exactly  1  because  of  rounding. 

This  probability  distribution  is  represented  by  the  histogram  in 
Figure  4.5  "Probability  Distribution  of  the  Binomial  Random 

Variable  in  ",  which  graphically  illustrates  just  how  improbable 
the  events  X  =  4  and  X  =  5  are.  The  corresponding  bar  in  the 
histogram  above  the  number  4  is  barely  visible,  if  visible  at  all, 
and  the  bar  above  5  is  far  too  short  to  be  visible. 
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Figure  4.5 

Probability  Distribution  of  the  Binomial  Random  Variable  it  Note  4.29  "Example  7" 


a.  The  value  of  X  that  is  most  likely  is  X  =  1,  so  the  most  frequent  number 
of  cases  seen  each  day  in  which  the  victim  knew  the  perpetrator  is  one. 

b.  The  average  number  of  cases  per  day  in  which  the  victim  knew 
the  perpetrator  is  the  mean  of  X,  which  is 

fl  =  HxP(x) 

=  0  •  0.3939  +  1  •  0.4034  +  2  •  0.1652  +  3  •  0.0338  +  4  •  0.0035 
=  0.8497 


Special  Formulas  for  the  Mean  and  Standard  Deviation  of  a 
Binomial  Random  Variable 

Since  a  binomial  random  variable  is  a  discrete  random  variable,  the  formulas  for  its 
mean,  variance,  and  standard  deviation  given  in  the  previous  section  apply  to  it,  as 
we  just  saw  in  Note  4.29  "Example  7"  in  the  case  of  the  mean.  However,  for  the 
binomial  random  variable  there  are  much  simpler  formulas. 
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If  X  is  a  binomial  random  variable  with  parameters  n  and  p,  then 

H  =  np  o2  =  npq  a  =  ^Jnpq 

where  q  —  \  —  p 


EXAMPLE  8 


Find  the  mean  and  standard  deviation  of  the  random  variable  X  of  Note  4.29 
"Example  7". 

Solution: 

The  random  variable  X  is  binomial  with  parameters  n  =  5  and  p  =  0.17,  and 
q  =  1  —  p  =  0.83.  Thus  its  mean  and  standard  deviation  are 

p  =  np  =  5-0.17  =  0.85  (exactly) 


and 

o  =  =  y/5  ■  0.17  •  0.83  =  V-  7055  «  0.8399 


The  Cumulative  Probability  Distribution  of  a  Binomial  Random 
Variable 

In  order  to  allow  a  broader  range  of  more  realistic  problems  Chapter  12  "Appendix" 
contains  probability  tables  for  binomial  random  variables  for  various  choices  of  the 
parameters  n  and  p.  These  tables  are  not  the  probability  distributions  that  we  have 
seen  so  far,  but  are  cumulative  probability  distributions.  In  the  place  of  the 
probability  P  (x)  the  table  contains  the  probability 

P(X<x)  =  P(0)+P(l)+  •••  +P(x) 

This  is  illustrated  in  Figure  4.6  "Cumulative  Probabilities".  The  probability  entered 
in  the  table  corresponds  to  the  area  of  the  shaded  region.  The  reason  for  providing 
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a  cumulative  table  is  that  in  practical  problems  that  involve  a  binomial  random 
variable  typically  the  probability  that  is  sought  is  of  the  form  P  (X  <  x)  or 
P  (X  >  x) .  The  cumulative  table  is  much  easier  to  use  for  computing  P  (X  <  x) 
since  all  the  individual  probabilities  have  already  been  computed  and  added.  The 
one  table  suffices  for  both  P  (X  <  x)  or  P  (X  >  x)  and  can  be  used  to  readily 
obtain  probabilities  of  the  form  P  (x),  too,  because  of  the  following  formulas.  The 
first  is  just  the  Probability  Rule  for  Complements. 


Figure  4.6  Cumulative  Probabilities 


If  X  is  a  discrete  random  variable,  then 

P  (X  >  x)  =  1  -  P  (X  <  x-1)  and  P  (x)  =  P  (X  <  x)  -  P  (X  < 


x— 1) 
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EXAMPLE  9 


A  student  takes  a  ten-question  true/false  exam. 

a.  Find  the  probability  that  the  student  gets  exactly  six  of  the  questions 
right  simply  by  guessing  the  answer  on  every  question. 

b.  Find  the  probability  that  the  student  will  obtain  a  passing  grade  of  60% 
or  greater  simply  by  guessing. 

Solution: 

Let  X  denote  the  number  of  questions  that  the  student  guesses  correctly. 

Then  X  is  a  binomial  random  variable  with  parameters  n  =  10  and  p  =  0.50. 

a.  The  probability  sought  is  P  (6)  .  The  formula  gives 

P( 6)  =  — —  (.  5)6.  54  =  0.205078125 
V  '  (6!)  (4!)  v  ’ 

Using  the  table, 

P  (6)  =  P  (X  <  6)  -P  (X  <  5)  =  0.8281  -  0.6230  -  0.2051 

b.  The  student  must  guess  correctly  on  at  least  60%  of  the 
questions,  which  is  0.60  •  10  =  6  questions.  The  probability 
sought  is  not  P  ( 6)  (an  easy  mistake  to  make),  but 

P(x>  6)  =P( 6)  +P(7)  +  P(8)  +  P(9)  +  P(10) 

Instead  of  computing  each  of  these  five  numbers  using  the 
formula  and  adding  them  we  can  use  the  table  to  obtain 

P  (X  >  6)  =  1  -  P  (X  <  5)  =  1  -  0.6230  -  0.3770 

which  is  much  less  work  and  of  sufficient  accuracy  for  the 
situation  at  hand. 
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EXAMPLE  10 


An  appliance  repairman  services  five  washing  machines  on  site  each  day. 

One-third  of  the  service  calls  require  installation  of  a  particular  part. 

a.  The  repairman  has  only  one  such  part  on  his  truck  today.  Find  the 
probability  that  the  one  part  will  be  enough  today,  that  is,  that  at  most 
one  washing  machine  he  services  will  require  installation  of  this 
particular  part. 

b.  Find  the  minimum  number  of  such  parts  he  should  take  with  him  each 
day  in  order  that  the  probability  that  he  have  enough  for  the  day's 
service  calls  is  at  least  95%. 

Solution: 

Let  X  denote  the  number  of  service  calls  today  on  which  the  part  is  required. 

Then  X  is  a  binomial  random  variable  with  parameters  n  =  5  and 

p  —  \  /  3  —  0.3. 

a.  Note  that  the  probability  in  question  is  not  P  ( 1 ),  but  rather  P(X 
s  l).  Using  the  cumulative  distribution  table  in  Chapter  12 
"Appendix". 

P(X  <  1)  =  0.4609 

b.  The  answer  is  the  smallest  number  x  such  that  the  table  entry 
P  (X  <  x)  is  at  least  0.9500.  Since  P  (X  <  2)  =  0.7901  is  less 
than  0.95,  two  parts  are  not  enough.  Since  P  ( X  <  3)  =  0.9547  is  as 
large  as  0.95,  three  parts  will  suffice  at  least  95%  of  the  time.  Thus  the 
minimum  needed  is  three. 
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KEY  TAKEAWAYS 


•  The  discrete  random  variable  X  that  counts  the  number  of  successes  in  n 
identical,  independent  trials  of  a  procedure  that  always  results  in  either 
of  two  outcomes,  “success”  or  “failure,”  and  in  which  the  probability  of 
success  on  each  trial  is  the  same  number  p,  is  called  the  binomial 
random  variable  with  parameters  n  and  p. 

•  There  is  a  formula  for  the  probability  that  the  binomial  random  variable 
with  parameters  n  and  p  will  take  a  particular  value  x. 

•  There  are  special  formulas  for  the  mean,  variance,  and  standard 
deviation  of  the  binomial  random  variable  with  parameters  n  and  p  that 
are  much  simpler  than  the  general  formulas  that  apply  to  all  discrete 
random  variables. 

•  Cumulative  probability  distribution  tables,  when  available,  facilitate 
computation  of  probabilities  encountered  in  typical  practical  situations. 
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1.  Determine  whether  or  not  the  random  variable  X  is  a  binomial  random 
variable,  if  so,  give  the  values  of  n  and  p.  if  not,  explain  why  not. 

a.  X  is  the  number  of  dots  on  the  top  face  of  fair  die  that  is  rolled. 

b.  X  is  the  number  of  hearts  in  a  five-card  hand  drawn  (without  replacement) 
from  a  well-shuffled  ordinary  deck. 

c.  X  is  the  number  of  defective  parts  in  a  sample  of  ten  randomly  selected 
parts  coming  from  a  manufacturing  process  in  which  0.02%  of  all  parts  are 
defective. 

d.  X  is  the  number  of  times  the  number  of  dots  on  the  top  face  of  a  fair  die  is 
even  in  six  rolls  of  the  die. 

e.  X  is  the  number  of  dice  that  show  an  even  number  of  dots  on  the  top  face 
when  six  dice  are  rolled  at  once. 

2.  Determine  whether  or  not  the  random  variable  X  is  a  binomial  random 
variable,  if  so,  give  the  values  of  n  and  p.  if  not,  explain  why  not. 

a.  X  is  the  number  of  black  marbles  in  a  sample  of  5  marbles  drawn  randomly 
and  without  replacement  from  a  box  that  contains  25  white  marbles  and  15 
black  marbles. 

b.  X  is  the  number  of  black  marbles  in  a  sample  of  5  marbles  drawn  randomly 
and  with  replacement  from  a  box  that  contains  25  white  marbles  and  15 
black  marbles. 

c.  X  is  the  number  of  voters  in  favor  of  proposed  law  in  a  sample  1,200 
randomly  selected  voters  drawn  from  the  entire  electorate  of  a  country  in 
which  35%  of  the  voters  favor  the  law. 

d.  X  is  the  number  of  fish  of  a  particular  species,  among  the  next  ten  landed 
by  a  commercial  fishing  boat,  that  are  more  than  13  inches  in  length,  when 
17%  of  all  such  fish  exceed  13  inches  in  length. 

e.  X  is  the  number  of  coins  that  match  at  least  one  other  coin  when  four  coins 
are  tossed  at  once. 

3.  X  is  a  binomial  random  variable  with  parameters  n  =  12  and  p  =  0.82.  Compute 
the  probability  indicated. 

a.  P(ll) 

b-  P  (9) 
c-  P( 0) 
d.  P  (13) 
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4.  X  is  a  binomial  random  variable  with  parameters  n  =  16  and  p  =  0.74.  Compute 
the  probability  indicated. 

a.  P  (14) 

b.  P  (4) 

c.  P( 0) 

d.  P  (20) 

5.  X  is  a  binomial  random  variable  with  parameters  n  =  5,  p  =  0.5.  Use  the  tables  in 
Chapter  12  "Appendix"  to  compute  the  probability  indicated. 

a.  P(Xs3) 

b.  P(X  a  3) 

c-  P( 3) 

d.  P  (0) 

e.  P( 5) 

6.  X  is  a  binomial  random  variable  with  parameters  n  =  5,  /?  =  0.  3 .  Use  the 
table  in  Chapter  12  "Appendix"  to  compute  the  probability  indicated. 

a.  P(Xs  2) 

b.  P(Xa  2) 

c-  P(2) 

d.  P  (0) 

e.  P(5) 

7.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Use  the  tables  in 
Chapter  12  "Appendix"  to  compute  the  probability  indicated. 

a.  n  =  10,  p  =  0.25,  P(X  s  6) 

b.  n  =  10,  p  =  0.75,  P(X  £  6) 

c.  n  =  15,  p  =  0.75,  P(X  £  6) 

d.  n  =  15,  p  =  0.75,  P  (12) 

e.  n  =  15,/?  =  0.  6,P(10  <  X  <  12) 

8.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Use  the  tables  in 
Chapter  12  "Appendix"  to  compute  the  probability  indicated. 

a.  n  =  5,  p  =  0.05,  P(X  s  l) 

b.  n  =  5,  p  =  0.5,  P(X  s  l) 

c.  n  =  10,  p  =  0.75,  P(X  s  5) 

d.  n  =  10,  p  =  0.75,  P  (12) 

e.  n  =  10,/?  =  0.6,P(5  <X  <  8) 
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9.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Use  the  special 
formulas  to  compute  its  mean  pi  and  standard  deviation  a. 

a.  n  =  8,  p  =  0.43 

b.  n  =  47,  p  =  0.82 

c.  n  =  1200,  p  =  0.44 

d.  n  =  2100,  p  =  0.62 

10.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Use  the  special 
formulas  to  compute  its  mean  pi  and  standard  deviation  a. 


a.  n  =  14,  p  =  0.55 

b.  n  =  83,  p  =  0.05 

c.  n  =  957,  p  =  0.35 

d.  n=  1750,  p  =  0.79 


11.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Compute  its  mean 
pi  and  standard  deviation  a  in  two  ways,  first  using  the  tables  in  Chapter  12 
"Appendix"  in  conjunction  with  the  general  formulas  pi  =  Xx  P(x )  and 


O  = 

a  = 


yJ[Zx2P(x)]  —  pi2,  then  using  the  special  formulas  pi  =  np  and 

\Jnpq- 


a.  n  =  5,p  =  0.  3 

b.  n  =  10,  p  =  0.75 


12.  X  is  a  binomial  random  variable  with  the  parameters  shown.  Compute  its  mean 
pi  and  standard  deviation  o  in  two  ways,  first  using  the  tables  in  Chapter  12 
"Appendix"  in  conjunction  with  the  general  formulas  p  =  Hx  P(x^)  and 


£7  =  ^/[Xv2  P(x)\  —  pi2,  then  using  the  special  formulas  pi  =  np  and 


<7 


=  \Jnpq 


a.  n  =  10,  p  =  0.25 

b.  n  =  15,  p  =  0.1 

13.  X  is  a  binomial  random  variable  with  parameters  n  =  10  and p  =  1/3.  Use 
the  cumulative  probability  distribution  for  X  that  is  given  in  Chapter  12 
"Appendix"  to  construct  the  probability  distribution  of  X. 

14.  X  is  a  binomial  random  variable  with  parameters  n  =  15  and p  =  1  /  2.  Use 
the  cumulative  probability  distribution  for  X  that  is  given  in  Chapter  12 
"Appendix"  to  construct  the  probability  distribution  of  X. 
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15.  In  a  certain  board  game  a  player's  turn  begins  with  three  rolls  of  a  pair  of  dice. 
If  the  player  rolls  doubles  all  three  times  there  is  a  penalty.  The  probability  of 
rolling  doubles  in  a  single  roll  of  a  pair  of  fair  dice  is  l/6.  Find  the  probability 
of  rolling  doubles  all  three  times. 

16.  A  coin  is  bent  so  that  the  probability  that  it  lands  heads  up  is  2/3.  The  coin  is 
tossed  ten  times. 

a.  Find  the  probability  that  it  lands  heads  up  at  most  five  times. 

b.  Find  the  probability  that  it  lands  heads  up  more  times  than  it  lands  tails 
up. 


APPLICATIONS 


17.  An  English-speaking  tourist  visits  a  country  in  which  30%  of  the  population 
speaks  English.  He  needs  to  ask  someone  directions. 

a.  Find  the  probability  that  the  first  person  he  encounters  will  be  able  to 
speak  English. 

b.  The  tourist  sees  four  local  people  standing  at  a  bus  stop.  Find  the 
probability  that  at  least  one  of  them  will  be  able  to  speak  English. 

18.  The  probability  that  an  egg  in  a  retail  package  is  cracked  or  broken  is  0.025. 

a.  Find  the  probability  that  a  carton  of  one  dozen  eggs  contains  no  eggs  that 
are  either  cracked  or  broken. 

b.  Find  the  probability  that  a  carton  of  one  dozen  eggs  has  (i)  at  least  one  that 
is  either  cracked  or  broken;  (ii)  at  least  two  that  are  cracked  or  broken. 

c.  Find  the  average  number  of  cracked  or  broken  eggs  in  one  dozen  cartons. 

19.  An  appliance  store  sells  20  refrigerators  each  week.  Ten  percent  of  all 
purchasers  of  a  refrigerator  buy  an  extended  warranty.  Fet  X  denote  the 
number  of  the  next  20  purchasers  who  do  so. 

a.  Verify  that  X  satisfies  the  conditions  for  a  binomial  random  variable,  and 
find  n  and  p. 

b.  Find  the  probability  that  X  is  zero. 

c.  Find  the  probability  that  X  is  two,  three,  or  four. 

d.  Find  the  probability  that  X  is  at  least  five. 

20.  Adverse  growing  conditions  have  caused  5%  of  grapefruit  grown  in  a  certain 
region  to  be  of  inferior  quality.  Grapefruit  are  sold  by  the  dozen. 

a.  Find  the  average  number  of  inferior  quality  grapefruit  per  box  of  a  dozen. 
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b.  A  box  that  contains  two  or  more  grapefruit  of  inferior  quality  will  cause  a 
strong  adverse  customer  reaction.  Find  the  probability  that  a  box  of  one 
dozen  grapefruit  will  contain  two  or  more  grapefruit  of  inferior  quality. 

21.  The  probability  that  a  7-ounce  skein  of  a  discount  worsted  weight  knitting 
yarn  contains  a  knot  is  0.25.  Goneril  buys  ten  skeins  to  crochet  an  afghan. 

a.  Find  the  probability  that  (i)  none  of  the  ten  skeins  will  contain  a  knot;  (ii) 
at  most  one  will. 

b.  Find  the  expected  number  of  skeins  that  contain  knots. 

c.  Find  the  most  likely  number  of  skeins  that  contain  knots. 

22.  One-third  of  all  patients  who  undergo  a  non-invasive  but  unpleasant  medical 
test  require  a  sedative.  A  laboratory  performs  20  such  tests  daily.  Let  X  denote 
the  number  of  patients  on  any  given  day  who  require  a  sedative. 

a.  Verify  that  X  satisfies  the  conditions  for  a  binomial  random  variable,  and 
find  n  and  p. 

b.  Find  the  probability  that  on  any  given  day  between  five  and  nine  patients 
will  require  a  sedative  (include  five  and  nine). 

c.  Find  the  average  number  of  patients  each  day  who  require  a  sedative. 

d.  Using  the  cumulative  probability  distribution  for  X  in  Chapter  12 
"Appendix",  find  the  minimum  number  X  mjn  of  doses  of  the  sedative  that 
should  be  on  hand  at  the  start  of  the  day  so  that  there  is  a  99%  chance  that 
the  laboratory  will  not  run  out. 

23.  About  2%  of  alumni  give  money  upon  receiving  a  solicitation  from  the  college 
or  university  from  which  they  graduated.  Find  the  average  number  monetary 
gifts  a  college  can  expect  from  every  2,000  solicitations  it  sends. 

24.  Of  all  college  students  who  are  eligible  to  give  blood,  about  18%  do  so  on  a 
regular  basis.  Each  month  a  local  blood  bank  sends  an  appeal  to  give  blood  to 
250  randomly  selected  students.  Find  the  average  number  of  appeals  in  such 
mailings  that  are  made  to  students  who  already  give  blood. 

25.  About  12%  of  all  individuals  write  with  their  left  hands.  A  class  of  130  students 
meets  in  a  classroom  with  130  individual  desks,  exactly  14  of  which  are 
constructed  for  people  who  write  with  their  left  hands.  Find  the  probability 
that  exactly  14  of  the  students  enrolled  in  the  class  write  with  their  left  hands. 

26.  A  travelling  salesman  makes  a  sale  on  65%  of  his  calls  on  regular  customers.  He 
makes  four  sales  calls  each  day. 

a.  Construct  the  probability  distribution  of  X,  the  number  of  sales  made  each 
day. 
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b.  Find  the  probability  that,  on  a  randomly  selected  day,  the  salesman  will 
make  a  sale. 

c.  Assuming  that  the  salesman  makes  20  sales  calls  per  week,  find  the  mean 
and  standard  deviation  of  the  number  of  sales  made  per  week. 

27.  A  corporation  has  advertised  heavily  to  try  to  insure  that  over  half  the  adult 
population  recognizes  the  brand  name  of  its  products.  In  a  random  sample  of 
20  adults,  14  recognized  its  brand  name.  What  is  the  probability  that  14  or 
more  people  in  such  a  sample  would  recognize  its  brand  name  if  the  actual 
proportion  p  of  all  adults  who  recognize  the  brand  name  were  only  0.50? 


ADDITIONAL  EXERCISES 


28.  When  dropped  on  a  hard  surface  a  thumbtack  lands  with  its  sharp  point 
touching  the  surface  with  probability  2/3;  it  lands  with  its  sharp  point  directed 
up  into  the  air  with  probability  l/3.  The  tack  is  dropped  and  its  landing 
position  observed  15  times. 

a.  Find  the  probability  that  it  lands  with  its  point  in  the  air  at  least  7  times. 

b.  If  the  experiment  of  dropping  the  tack  15  times  is  done  repeatedly,  what  is 
the  average  number  of  times  it  lands  with  its  point  in  the  air? 

29.  A  professional  proofreader  has  a  98%  chance  of  detecting  an  error  in  a  piece  of 
written  work  (other  than  misspellings,  double  words,  and  similar  errors  that 
are  machine  detected).  A  work  contains  four  errors. 

a.  Find  the  probability  that  the  proofreader  will  miss  at  least  one  of  them. 

b.  Show  that  two  such  proofreaders  working  independently  have  a  99.96% 
chance  of  detecting  an  error  in  a  piece  of  written  work. 

c.  Find  the  probability  that  two  such  proofreaders  working  independently 
will  miss  at  least  one  error  in  a  work  that  contains  four  errors. 

30.  A  multiple  choice  exam  has  20  questions;  there  are  four  choices  for  each 
question. 

a.  A  student  guesses  the  answer  to  every  question.  Find  the  chance  that  he 
guesses  correctly  between  four  and  seven  times. 

b.  Find  the  minimum  score  the  instructor  can  set  so  that  the  probability  that 
a  student  will  pass  just  by  guessing  is  20%  or  less. 

31.  In  spite  of  the  requirement  that  all  dogs  boarded  in  a  kennel  be  inoculated,  the 
chance  that  a  healthy  dog  boarded  in  a  clean,  well-ventilated  kennel  will 
develop  kennel  cough  from  a  carrier  is  0.008. 


4.3  The  Binomial  Distribution 


223 


Chapter  4  Discrete  Random  Variables 


a.  If  a  carrier  (not  known  to  be  such,  of  course)  is  boarded  with  three  other 
dogs,  what  is  the  probability  that  at  least  one  of  the  three  healthy  dogs  will 
develop  kennel  cough? 

b.  If  a  carrier  is  boarded  with  four  other  dogs,  what  is  the  probability  that  at 
least  one  of  the  four  healthy  dogs  will  develop  kennel  cough? 

c.  The  pattern  evident  from  parts  (a)  and  (b)  is  that  if  K  +  1  dogs  are 
boarded  together,  one  a  carrier  and  K  healthy  dogs,  then  the  probability 
that  at  least  one  of  the  healthy  dogs  will  develop  kennel  cough  is 

P  (X  >  1)  =  1  —  (0.992)  ,  where  X  is  the  binomial  random  variable 
that  counts  the  number  of  healthy  dogs  that  develop  the  condition. 
Experiment  with  different  values  of  K  in  this  formula  to  find  the  maximum 
number  K  +  1  of  dogs  that  a  kennel  owner  can  board  together  so  that  if 
one  of  the  dogs  has  the  condition,  the  chance  that  another  dog  will  be 
infected  is  less  than  0.05. 

32.  Investigators  need  to  determine  which  of  600  adults  have  a  medical  condition 
that  affects  2%  of  the  adult  population.  A  blood  sample  is  taken  from  each  of 
the  individuals. 

a.  Show  that  the  expected  number  of  diseased  individuals  in  the  group  of  600 
is  12  individuals. 

b.  Instead  of  testing  all  600  blood  samples  to  find  the  expected  12  diseased 
individuals,  investigators  group  the  samples  into  60  groups  of  10  each,  mix 
a  little  of  the  blood  from  each  of  the  10  samples  in  each  group,  and  test 
each  of  the  60  mixtures.  Show  that  the  probability  that  any  such  mixture 
will  contain  the  blood  of  at  least  one  diseased  person,  hence  test  positive, 
is  about  0.18. 

c.  Based  on  the  result  in  (b),  show  that  the  expected  number  of  mixtures  that 
test  positive  is  about  11.  (Supposing  that  indeed  11  of  the  60  mixtures  test 
positive,  then  we  know  that  none  of  the  490  persons  whose  blood  was  in 
the  remaining  49  samples  that  tested  negative  has  the  disease.  We  have 
eliminated  490  persons  from  our  search  while  performing  only  60  tests.) 


4.3  The  Binomial  Distribution 


224 


Chapter  4  Discrete  Random  Variables 


ANSWERS 


1. 

a. 

not  binomial;  not  success/failure. 

b. 

not  binomial;  trials  are  not  independent. 

c. 

binomial;  n  =  10,  p  =  0.0002 

d. 

binomial;  n  =  6,  p  =  0.5 

e. 

binomial;  n  =  6,  p  =  0.5 

3. 

a. 

0.2434 

b. 

0.2151 

c. 

0.1812  «  0 

d. 

0 

5. 

a. 

0.8125 

b. 

0.5000 

c. 

0.3125 

d. 

0.0313 

e. 

0.0312 

7. 

a. 

0.9965 

b. 

0.2241 

c. 

0.0042 

d. 

0.2252 

e. 

0.5390 

9. 

a. 

H  =  3.44,  <7=  1.4003 

b. 

fl  =  38.54,  <7=  2.6339 

c. 

fl  =  528,  <7=  17.1953 

d. 

jU  =  1302,  a  =22.2432 

11. 

a. 

p/  =  1.6667,  a  =  1.0541 

b. 

fl  =  7.5,  (7  =  1.3693 

13. 


15.  0.0046 


X 

0 

1 

2 

3 

P(x) 

0.0173 

0.0867 

0.1951 

0.2602 

X 

4 

5 

6 

7 

0.2276 

0.1365 

0.0569 

0.0163 

X 

8 

9 

10 

P(x) 

0.0030 

0.0004 

0.0000 
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17.  a. 

b. 

0.3 

0.7599 

19.  a. 

b. 

n  =  20,p  =  0.1 

0.1216 

c. 

d. 

0.5651 

0.0432 

21.  a. 

b. 

0.0563  and  0.2440 

2.5 

c. 

2 

23.  40 

25.  0.1019 
27.  0.0577 


29.  a. 

b. 

0.0776 

0.9996 

c. 

0.0016 

31.  a. 

b. 

0.0238 

0.0316 

c. 

6 
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Continuous  Random  Variables 


As  discussed  in  Section  4.1  "Random  Variables"  in  Chapter  4  "Discrete  Random 
Variables"  a  random  variable  is  called  continuous  if  its  set  of  possible  values 
contains  a  whole  interval  of  decimal  numbers.  In  this  chapter  we  investigate  such 
random  variables. 
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5.1  Continuous  Random  Variables 


LEARNING  OBJECTIVES 

1.  To  learn  the  concept  of  the  probability  distribution  of  a  continuous 
random  variable,  and  how  it  is  used  to  compute  probabilities. 

2.  To  learn  basic  facts  about  the  family  of  normally  distributed  random 
variables. 


The  Probability  Distribution  of  a  Continuous  Random  Variable 

For  a  discrete  random  variable  X  the  probability  that  X  assumes  one  of  its  possible 
values  on  a  single  trial  of  the  experiment  makes  good  sense.  This  is  not  the  case  for 
a  continuous  random  variable.  For  example,  suppose  X  denotes  the  length  of  time  a 
commuter  just  arriving  at  a  bus  stop  has  to  wait  for  the  next  bus.  if  buses  run  every 
30  minutes  without  fail,  then  the  set  of  possible  values  of  X  is  the  interval  denoted 
[0,30] ,  the  set  of  all  decimal  numbers  between  0  and  30.  But  although  the  number 
7.211916  is  a  possible  value  of  X,  there  is  little  or  no  meaning  to  the  concept  of  the 
probability  that  the  commuter  will  wait  precisely  7.211916  minutes  for  the  next  bus. 
if  anything  the  probability  should  be  zero,  since  if  we  could  meaningfully  measure 
the  waiting  time  to  the  nearest  millionth  of  a  minute  it  is  practically  inconceivable 
that  we  would  ever  get  exactly  7.211916  minutes.  More  meaningful  questions  are 
those  of  the  form:  What  is  the  probability  that  the  commuter's  waiting  time  is  less 
than  10  minutes,  or  is  between  5  and  10  minutes?  In  other  words,  with  continuous 
random  variables  one  is  concerned  not  with  the  event  that  the  variable  assumes  a 
single  particular  value,  but  with  the  event  that  the  random  variable  assumes  a  value 
in  a  particular  interval. 
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Definition 

The  probability  distribution  of  a  continuous  random  variable  X  is  an 

assignment  ofprohahilities  to  intervals  of  decimal  numbers  using  a  function  f  (x), 
called  a  density  function1,  in  the  following  way:  the  probability  that  X  assumes  a 
value  in  the  interval  [a,  /?]  is  equal  to  the  area  of  the  region  that  is  bounded  above  by 
the  graph  of  the  equation  y  —  f  (x),  bounded  below  by  the  x-axis,  and  bounded  on  the 
left  and  right  by  the  vertical  lines  through  a  and  b,  as  illustrated  in  Figure  5.1 
"Probability  Given  as  Area  of  a  Region  under  a  Curve". 

Figure  5.1 

Probability  Given  as  Area  of  a  Region  under  a  Curve 

P(a  <  X  <  b)  =  area  of  shaded  region 


1.  The  function  f  (x)  such  that 
probabilities  of  a  continuous 
random  variable  X  are  areas  of 
regions  under  the  graph  of 

y=f(x). 


This  definition  can  be  understood  as  a  natural  outgrowth  of  the  discussion  in 
Section  2.1.3  "Relative  Frequency  Histograms"  in  Chapter  2  "Descriptive  Statistics". 
There  we  saw  that  if  we  have  in  view  a  population  (or  a  very  large  sample)  and 
make  measurements  with  greater  and  greater  precision,  then  as  the  bars  in  the 
relative  frequency  histogram  become  exceedingly  fine  their  vertical  sides  merge 
and  disappear,  and  what  is  left  is  just  the  curve  formed  by  their  tops,  as  shown  in 
Figure  2.5  "Sample  Size  and  Relative  Frequency  Histograms"  in  Chapter  2 
"Descriptive  Statistics".  Moreover  the  total  area  under  the  curve  is  1,  and  the 
proportion  of  the  population  with  measurements  between  two  numbers  a  and  b  is 
the  area  under  the  curve  and  between  a  and  b,  as  shown  in  Figure  2.6  "A  Very  Fine 
Relative  Frequency  Histogram"  in  Chapter  2  "Descriptive  Statistics",  if  we  think  of  X 
as  a  measurement  to  infinite  precision  arising  from  the  selection  of  any  one 
member  of  the  population  at  random,  then  P  (a  <  X  <  b)  is  simply  the 
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proportion  of  the  population  with  measurements  between  a  and  b,  the  curve  in  the 
relative  frequency  histogram  is  the  density  function  for  X,  and  we  arrive  at  the 
definition  just  above. 


Every  density  function /  (x)  must  satisfy  the  following  two  conditions: 


1.  For  all  numbers  x,/  (x)  >  0,  so  that  the  graph  of  y  —  f  (. x )  never 
drops  below  the  x-axis. 

2.  The  area  of  the  region  under  the  graph  of  y  —  f  (. X )  and  above  the 
x-axis  is  1. 


Because  the  area  of  a  line  segment  is  0,  the  definition  of  the  probability  distribution 
of  a  continuous  random  variable  implies  that  for  any  particular  decimal  number, 
say  a,  the  probability  that  X  assumes  the  exact  value  a  is  0.  This  property  implies 
that  whether  or  not  the  endpoints  of  an  interval  are  included  makes  no  difference 
concerning  the  probability  of  the  interval. 


For  any  continuous  random  variable  X: 

P(a<X<b)=P(a<X<b)=P(a<X<b)=P(a<X< 
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EXAMPLE  1 


A  random  variable  X  has  the  uniform  distribution  on  the  interval  [0,1J  :  the 
density  function  is  f  (. X )  =  1  if  x  is  between  0  and  1  and /  ( X )  =  0  for  all 
other  values  of  x,  as  shown  in  Figure  5.2  "Uniform  Distribution  on 

Figure  5.2 


a.  Find  P(X  >  0.75),  the  probability  that  X  assumes  a  value  greater  than  0.75. 

b.  Find  P(X  s  0.2),  the  probability  that  X  assumes  a  value  less  than  or  equal 
to  0.2. 

c.  Find  P(0.4  <  X  <  0.7),  the  probability  that  X  assumes  a  value  between  0.4 
and  0.7. 

Solution: 

a.  P(X  >  0.75)  is  the  area  of  the  rectangle  of  height  1  and  base  length 

1  —  0.75  =  0.25  ,  hence  is 

base  X  height  =  (0.25)  •  (l)  =  0.25.  See  Figure  5.3 

"Probabilities  from  the  Uniform  Distribution  on  "(a). 

b.  P(X  s  0.2)  is  the  area  of  the  rectangle  of  height  1  and  base  length 

0.2  -  0  =  0.2 ,  hence  is  base  X  height  =  (0.2)  •  (l)  =  0.2. 

See  Figure  5.3  "Probabilities  from  the  Uniform  Distribution  on  "(b). 
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c.  P(0.4  <  X  <  0.7)  is  the  area  of  the  rectangle  of  height  1  and  length 

0.7  —  0.4  =  0.3  ,  hence  is 

base  X  height  =  (0.3)  •  (l)  =  0.3.  See  Figure  5.3  "Probabilities 
from  the  Uniform  Distribution  on  "(c). 

Figure  5.3 

Probabilities  from  the  Uniform  Distribution  on  [0,l] 


1 

0 

0.75  T  0 

(a) 

0.2  I  0 

(b) 

0.4  0.7  T 

(c) 

5.1  Continuous  Random  Variables 


232 


Chapter  5  Continuous  Random  Variables 


EXAMPLE  2 


A  man  arrives  at  a  bus  stop  at  a  random  time  (that  is,  with  no  regard  for  the 
scheduled  service)  to  catch  the  next  bus.  Buses  run  every  30  minutes 
without  fail,  hence  the  next  bus  will  come  any  time  during  the  next  30 
minutes  with  evenly  distributed  probability  (a  uniform  distribution).  Find 
the  probability  that  a  bus  will  come  within  the  next  10  minutes. 

Solution: 

The  graph  of  the  density  function  is  a  horizontal  line  above  the  interval 
from  0  to  30  and  is  the  x-axis  everywhere  else.  Since  the  total  area  under  the 
curve  must  be  1,  the  height  of  the  horizontal  line  is  l/30.  See  Figure  5.4 
"Probability  of  Waiting  At  Most  10  Minutes  for  a  Bus".  The  probability 
sought  is  P  (0  <  X  <  1 0)  .  By  definition,  this  probability  is  the  area  of 
the  rectangular  region  bounded  above  by  the  horizontal  line 
f  (x)  =  1  /  30 ,  bounded  below  by  the  x-axis,  bounded  on  the  left  by  the 
vertical  line  at  0  (the  y-axis),  and  bounded  on  the  right  by  the  vertical  line  at 
10.  This  is  the  shaded  region  in  Figure  5.4  "Probability  of  Waiting  At  Most  10 
Minutes  for  a  Bus".  Its  area  is  the  base  of  the  rectangle  times  its  height, 

10-  (1  /  30)  =  1/3 .  Thus P (0  <X  <  10)  =  1  /  3. 

Figure  5.4 

Probability  of  Waiting  At  Most  10  Minutes  for  a  Bus 
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Normal  Distributions 

Most  people  have  heard  of  the  “bell  curve.”  It  is  the  graph  of  a  specific  density 
function /  (x)  that  describes  the  behavior  of  continuous  random  variables  as 
different  as  the  heights  of  human  beings,  the  amount  of  a  product  in  a  container 
that  was  filled  by  a  high-speed  packing  machine,  or  the  velocities  of  molecules  in  a 
gas.  The  formula  for  f  (x)  contains  two  parameters  ft  and  a  that  can  be  assigned  any 
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specific  numerical  values,  so  long  as  a  is  positive.  We  will  not  need  to  know  the 
formula  for /  (x),  but  for  those  who  are  interested  it  is 

=  _ l _ e-l^-x)2/°2 

\JrLnc1 

where  n  ~  3. 14159  and  e  ~  2.71828  is  the  base  of  the  natural  logarithms. 


Each  different  choice  of  specific  numerical  values  for  the  pair  fx  and  a  gives  a 
different  bell  curve.  The  value  of  fx  determines  the  location  of  the  curve,  as  shown 
in  Figure  5.5  "Bell  Curves  with  ".  In  each  case  the  curve  is  symmetric  about  fx. 


Figure  5.5  Bell  Curves  with  a  =  0.25  and  Different  Values  of 

H  =  -2  fi  =  -1  ii=l 


The  value  of  a  determines  whether  the  bell  curve  is  tall  and  thin  or  short  and  squat, 
subject  always  to  the  condition  that  the  total  area  under  the  curve  be  equal  to  1. 
This  is  shown  in  Figure  5.6  "Bell  Curves  with  ",  where  we  have  arbitrarily  chosen  to 
center  the  curves  at  fi  =  6. 
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Figure  5.6  Bell  Curves  with  n  =  6  and  Different  Values  of  a 


2.  Assignment  of  probabilities  to 
a  continuous  random  variable 
using  a  bell  curve  for  the 
density  function. 

3.  A  continuous  random  variable 
whose  probabilities  are 
determined  by  a  bell  curve. 


Figure  5.7  "Density  Function  for  a  Normally  Distributed  Random  Variable  with 

Mean  "  shows  the  density  function  that  determines  the  normal  distribution  with 
mean  ft  and  standard  deviation  o.  We  repeat  an  important  fact  about  this  curve: 
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The  density  curve  for  the  normal  distribution  is  symmetric  about  the  mean. 


Figure  5.7  Density  Function  for  a  Normally  Distributed  Random  Variable  with  Mean  /a  and  Standard  Deviation 
o 
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EXAMPLE  3 


Heights  of  25-year-old  men  in  a  certain  region  have  mean  69.75  inches  and 
standard  deviation  2.59  inches.  These  heights  are  approximately  normally 
distributed.  Thus  the  height  X  of  a  randomly  selected  25-year-old  man  is  a 
normal  random  variable  with  mean  fi  =  69.75  and  standard  deviation  a  = 

2.59.  Sketch  a  qualitatively  accurate  graph  of  the  density  function  for  X.  Find 
the  probability  that  a  randomly  selected  25-year-old  man  is  more  than  69.75 
inches  tall. 

Solution: 

The  distribution  of  heights  looks  like  the  bell  curve  in  Figure  5.8  "Density 
Function  for  Heights  of  25-Year-Old  Men".  The  important  point  is  that  it  is 
centered  at  its  mean,  69.75,  and  is  symmetric  about  the  mean. 

Figure  5.8 

Density  Function  for  Fleights  of  25-Year-Old  Men 


Since  the  total  area  under  the  curve  is  1,  by  symmetry  the  area  to  the  right 
of  69.75  is  half  the  total,  or  0.5.  But  this  area  is  precisely  the  probability  P(X  > 
69.75),  the  probability  that  a  randomly  selected  25-year-old  man  is  more 
than  69.75  inches  tall. 

We  will  learn  how  to  compute  other  probabilities  in  the  next  two  sections. 
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KEY  TAKEAWAYS 


•  For  a  continuous  random  variable  X  the  only  probabilities  that  are 
computed  are  those  of  X  taking  a  value  in  a  specified  interval. 

•  The  probability  that  X  take  a  value  in  a  particular  interval  is  the  same 
whether  or  not  the  endpoints  of  the  interval  are  included. 

•  The  probability  P  (a  <  X  <  /?),  that  X  take  a  value  in  the  interval 
from  a  to  b,  is  the  area  of  the  region  between  the  vertical  lines  through  a 
and  b,  above  the  x-axis,  and  below  the  graph  of  a  function  f  ( X )  called 
the  density  function. 

•  A  normally  distributed  random  variable  is  one  whose  density  function  is 
a  bell  curve. 

•  Every  bell  curve  is  symmetric  about  its  mean  and  lies  everywhere  above 
the  x-axis,  which  it  approaches  asymptotically  (arbitrarily  closely 
without  touching). 
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1.  A  continuous  random  variable  X  has  a  uniform  distribution  on  the  interval 
[5, 1 2]  .  Sketch  the  graph  of  its  density  function. 

2.  A  continuous  random  variable  X  has  a  uniform  distribution  on  the  interval 
[-3,3]  •  Sketch  the  graph  of  its  density  function. 

3.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  100  and 
standard  deviation  10.  Sketch  a  qualitatively  accurate  graph  of  its  density 
function. 

4.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  73  and 
standard  deviation  2.5.  Sketch  a  qualitatively  accurate  graph  of  its  density 
function. 

5.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  73.  The 
probability  that  X  takes  a  value  greater  than  80  is  0.212.  Use  this  information 
and  the  symmetry  of  the  density  function  to  find  the  probability  that  X  takes  a 
value  less  than  66.  Sketch  the  density  curve  with  relevant  regions  shaded  to 
illustrate  the  computation. 

6.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  169.  The 
probability  that  X  takes  a  value  greater  than  180  is  0.17.  Use  this  information 
and  the  symmetry  of  the  density  function  to  find  the  probability  that  X  takes  a 
value  less  than  158.  Sketch  the  density  curve  with  relevant  regions  shaded  to 
illustrate  the  computation. 

7.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  50.5.  The 
probability  that  X  takes  a  value  less  than  54  is  0.76.  Use  this  information  and 
the  symmetry  of  the  density  function  to  find  the  probability  that  X  takes  a 
value  greater  than  47.  Sketch  the  density  curve  with  relevant  regions  shaded 
to  illustrate  the  computation. 

8.  A  continuous  random  variable  X  has  a  normal  distribution  with  mean  12.25. 
The  probability  that  X  takes  a  value  less  than  13  is  0.82.  Use  this  information 
and  the  symmetry  of  the  density  function  to  find  the  probability  that  X  takes  a 
value  greater  than  11.50.  Sketch  the  density  curve  with  relevant  regions 
shaded  to  illustrate  the  computation. 

9.  The  figure  provided  shows  the  density  curves  of  three  normally  distributed 
random  variables  Xa,  Xg,  and  Xq.  Their  standard  deviations  (in  no  particular 
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order)  are  15,  7,  and  20.  Use  the  figure  to  identify  the  values  of  the  means  flA , 
/ i B ,  and  yW  c  and  standard  deviations  (7A,Ob,  and  <7c  of  the  three  random 
variables. 


10.  The  figure  provided  shows  the  density  curves  of  three  normally  distributed 
random  variables  Xa ,  Xg,  and  Xq.  Their  standard  deviations  (in  no  particular 
order)  are  20,  5,  and  10.  Use  the  figure  to  identify  the  values  of  the  means  /UA , 
/ 1 B ,  and  /I  q  and  standard  deviations  UA ,  Ob  ,  and  <7 c  of  the  three  random 
variables. 


11.  Dogberry's  alarm  clock  is  battery  operated.  The  battery  could  fail  with  equal 
probability  at  any  time  of  the  day  or  night.  Every  day  Dogberry  sets  his  alarm 
for  6:30  a.m.  and  goes  to  bed  at  10:00  p.m.  Find  the  probability  that  when  the 
clock  battery  finally  dies,  it  will  do  so  at  the  most  inconvenient  time,  between 
10:00  p.m.  and  6:30  a.m. 

12.  Buses  running  a  bus  line  near  Desdemona's  house  run  every  15  minutes. 
Without  paying  attention  to  the  schedule  she  walks  to  the  nearest  stop  to  take 
the  bus  to  town.  Find  the  probability  that  she  waits  more  than  10  minutes. 

13.  The  amount  X  of  orange  juice  in  a  randomly  selected  half-gallon  container 
varies  according  to  a  normal  distribution  with  mean  64  ounces  and  standard 
deviation  0.25  ounce. 

a.  Sketch  the  graph  of  the  density  function  for  X. 

b.  What  proportion  of  all  containers  contain  less  than  a  half  gallon  (64 
ounces)?  Explain. 

c.  What  is  the  median  amount  of  orange  juice  in  such  containers?  Explain. 
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14.  The  weight  X  of  grass  seed  in  bags  marked  50  lb  varies  according  to  a  normal 
distribution  with  mean  50  lb  and  standard  deviation  1  ounce  (0.0625  lb). 

a.  Sketch  the  graph  of  the  density  function  for  X. 

b.  What  proportion  of  all  bags  weigh  less  than  50  pounds?  Explain. 

c.  What  is  the  median  weight  of  such  bags?  Explain. 


ANSWERS 


1.  The  graph  is  a  horizontal  line  with  height  l/ 7  from  x  =  5  to  x  =  12 

3.  The  graph  is  a  bell-shaped  curve  centered  at  100  and  extending  from  about  70 
to  130. 

5.  0.212 

7.  0.76 

9.  fiA  =  100,  nB  =  200,  fic  —  300,  ga  =7, ob  =  20,  oc  =  15 

11.  0.3542 

13.  a.  The  graph  is  a  bell-shaped  curve  centered  at  64  and  extending  from  about 
63.25  to  64.75. 

b.  0.5 

c.  64 
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5.2  The  Standard  Normal  Distribution 


LEARNING  OBJECTIVES 

1.  To  learn  what  a  standard  normal  random  variable  is. 

2.  To  learn  how  to  use  Figure  12.2  "Cumulative  Normal  Probability"  to 
compute  probabilities  related  to  a  standard  normal  random  variable. 


Definition 

A  standard  normal  random  variable4  is  a  normally  distributed  random  variable 
with  mean  ju  =  0  and  standard  deviation  a  =  1.  It  will  always  be  denoted  by  the  letter  Z. 


The  density  function  for  a  standard  normal  random  variable  is  shown  in  Figure  5.9 
"Density  Curve  for  a  Standard  Normal  Random  Variable". 


Figure  5.9  Density  Curve  for  a  Standard  Normal  Random  Variable 


4.  The  normal  random  variable 
with  mean  0  and  standard 
deviation  1. 


-1  0  1 
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To  compute  probabilities  for  Z  we  will  not  work  with  its  density  function  directly 
but  instead  read  probabilities  out  of  Figure  12.2  "Cumulative  Normal  Probability"  in 
Chapter  12  "Appendix".  The  tables  are  tables  of  cumulative  probabilities;  their 
entries  are  probabilities  of  the  form  P  (Z  <  z)  ■  The  use  of  the  tables  will  be 
explained  by  the  following  series  of  examples. 


EXAMPLE  4 


Find  the  probabilities  indicated,  where  as  always  Z  denotes  a  standard 
normal  random  variable. 

a.  P(Z<  1.48). 

b.  P(Z<  -0.25). 

Solution: 


a.  Figure  5.10  "Computing  Probabilities  Using  the  Cumulative  Table"  shows 
how  this  probability  is  read  directly  from  the  table  without  any 
computation  required.  The  digits  in  the  ones  and  tenths  places  of  1.48, 
namely  1.4,  are  used  to  select  the  appropriate  row  of  the  table;  the 
hundredths  part  of  1.48,  namely  0.08,  is  used  to  select  the  appropriate 
column  of  the  table.  The  four  decimal  place  number  in  the  interior  of 
the  table  that  lies  in  the  intersection  of  the  row  and  column  selected, 
0.9306,  is  the  probability  sought:  P  (Z  <  1.48)  =  0.9306. 


Figure  5.10 

Computing  Probabilities  Using  the  Cumulative  Table 


area  of  the  shaded  region  =  0.9306 


a.  The  minus  sign  in  -0.25  makes  no  difference  in  the  procedure;  the  table 
is  used  in  exactly  the  same  way  as  in  part  (a):  the  probability  sought  is 
the  number  that  is  in  the  intersection  of  the  row  with  heading  -0.2  and 
the  column  with  heading  0.05,  the  number  0.4013.  Thus  P(Z  <  -0.25)  = 
0.4013. 
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EXAMPLE  5 


Find  the  probabilities  indicated. 

a.  P(Z>  1.60). 

b.  P(Z>  -1.02). 

Solution: 


a.  Because  the  events  Z  >  1.60  and  Z  s  1.60  are  complements,  the 
Probability  Rule  for  Complements  implies  that 

P(z>  1.60)  =  1  -P(Z<  1.60) 

Since  inclusion  of  the  endpoint  makes  no  difference  for  the 
continuous  random  variable  Z, 

P  [Z  <  1.60)  =  P  [Z  <  1.60)  ,  which  we  know  how  to 

find  from  the  table.  The  number  in  the  row  with  heading  1.6  and 
in  the  column  with  heading  0.00  is  0.9452.  Thus 

P(z  <  1.60)  =  0.9452  so 

P  (Z  >  1.60)  =  1  -  P  (Z  <  1.60)  =  1  -  0.9452  =  0.0548 

Figure  5.11  "Computing  a  Probability  for  a  Right  Half-Line" 

illustrates  the  ideas  geometrically.  Since  the  total  area  under  the 
curve  is  1  and  the  area  of  the  region  to  the  left  of  1.60  is  (from 
the  table)  0.9452,  the  area  of  the  region  to  the  right  of  1.60  must 

be  1  -  0.9452  =  0.0548. 
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Figure  5.11 

Computing  a  Probability  for  a  Right  Half-Line 

Normal  Table:  area  =  0  .9452 


a.  The  minus  sign  in  -1.02  makes  no  difference  in  the  procedure; 
the  table  is  used  in  exactly  the  same  way  as  in  part  (a).  The 
number  in  the  intersection  of  the  row  with  heading  -1.0  and  the 
column  with  heading  0.02  is  0.1539.  This  means  that 

P(Z  <  -1.02)  =  P  (Z  <  -1.02)  =  0.1539,  hence 
P  (Z  >  -1.02)  =  1  -  P  (Z  <  -1.02)  =  1  -  0.1539  =  0.8461 
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EXAMPLE  6 


Find  the  probabilities  indicated. 

a.  P( 0.5  <Z<  1.57). 

b.  P  (-2.55  <  Z  <  0.09)  . 

Solution: 

a.  Figure  5.12  "Computing  a  Probability  for  an  Interval  of  Finite 

Length"  illustrates  the  ideas  involved  for  intervals  of  this  type. 

First  look  up  the  areas  in  the  table  that  correspond  to  the 
numbers  0.5  (which  we  think  of  as  0.50  to  use  the  table)  and  1.57. 

We  obtain  0.6915  and  0.9418,  respectively.  From  the  figure  it  is 
apparent  that  we  must  take  the  difference  of  these  two  numbers 
to  obtain  the  probability  desired.  In  symbols, 

P  (0.5  <  Z  <  1.57)  =P(Z  <  1.57)  -P(Z<  0.50)  =  0.9418  -  0. 

Figure  5.12 

Computing  a  Probability  for  an  Interval  of  Finite  Length 

Normal  Table:  area  =  0.9418 


a.  The  procedure  for  finding  the  probability  that  Z  takes  a  value  in 
a  finite  interval  whose  endpoints  have  opposite  signs  is  exactly 
the  same  procedure  used  in  part  (a),  and  is  illustrated  in  Figure 
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5.13  "Computing  a  Probability  for  an  Interval  of  Finite  Length". 

In  symbols  the  computation  is 


P  (-2.55  <  Z  <  0.09) 


PiZ  <  0.09  )-P(Z  <  -2.55) 
0.5359  -  0.0054  =  0.5305 


Figure  5.13 

Computing  a  Probability  for  an  Interval  of  Finite  Length 
Normal  Table:  area  =  0.5359 
< - 


Normal  Table: 
area  =  0.0054 


The  next  example  shows  what  to  do  if  the  value  of  Z  that  we  want  to  look  up  in  the 
table  is  not  present  there. 
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EXAMPLE  7 


Find  the  probabilities  indicated. 

a.  P  ( 1.13  <Z<  4.16). 

b.  P  (-5.22  <  Z  <  2.15)  . 

Solution: 


a.  We  attempt  to  compute  the  probability  exactly  as  in  Note  5.20 
"Example  6"  by  looking  up  the  numbers  1.13  and  4.16  in  the 
table.  We  obtain  the  value  0.8708  for  the  area  of  the  region  under 
the  density  curve  to  left  of  1.13  without  any  problem,  but  when 
we  go  to  look  up  the  number  4.16  in  the  table,  it  is  not  there.  We 
can  see  from  the  last  row  of  numbers  in  the  table  that  the  area  to 
the  left  of  4.16  must  be  so  close  to  1  that  to  four  decimal  places  it 
rounds  to  1.0000.  Therefore 

P  (1.13  <  Z  <  4.16)  =  1.0000  -  0.8708  =  0.1292 

b.  Similarly,  here  we  can  read  directly  from  the  table  that  the  area 
under  the  density  curve  and  to  the  left  of  2.15  is  0.9842,  but  -5.22 
is  too  far  to  the  left  on  the  number  line  to  be  in  the  table.  We  can 
see  from  the  first  line  of  the  table  that  the  area  to  the  left  of -5.22 
must  be  so  close  to  0  that  to  four  decimal  places  it  rounds  to 
0.0000.  Therefore 

P  (-5.22  <  Z  <  2.15)  =  0.9842  -  0.0000  =  0.9842 


The  final  example  of  this  section  explains  the  origin  of  the  proportions  given  in  the 
Empirical  Rule. 
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EXAMPLE  8 


Find  the  probabilities  indicated. 

a.  P(-l  <  Z  <  1). 

b.  P  (-2  <  Z  <  2) . 

C.  P  (-3  <  Z  <  3) . 

Solution: 

a.  Using  the  table  as  was  done  in  Note  5.20  "Example  6"(b)  we 
obtain 

P(-l  <  Z  <  1)  =  0.8413  -  0.1587  =  0.6826 

Since  Z  has  mean  0  and  standard  deviation  1,  for  Z  to  take  a  value 
between  -1  and  1  means  that  Z  takes  a  value  that  is  within  one 
standard  deviation  of  the  mean.  Our  computation  shows  that  the 
probability  that  this  happens  is  about  0.68,  the  proportion  given 
by  the  Empirical  Rule  for  histograms  that  are  mound  shaped  and 
symmetrical,  like  the  bell  curve. 

b.  Using  the  table  in  the  same  way, 

P  (-2  <  Z  <  2)  =  0.9772  -  0.0228  =  0.9544 

This  corresponds  to  the  proportion  0.95  for  data  within  two 
standard  deviations  of  the  mean. 

c.  Similarly, 

P  (-3  <  Z  <  3)  =  0.9987  -  0.0013  -  0.9974 

which  corresponds  to  the  proportion  0.997  for  data  within  three 
standard  deviations  of  the  mean. 
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KEY  TAKEAWAYS 


•  A  standard  normal  random  variable  Z  is  a  normally  distributed  random 
variable  with  mean  /j  =  0  and  standard  deviation  a  =  1. 

•  Probabilities  for  a  standard  normal  random  variable  are  computed  using 
Figure  12.2  "Cumulative  Normal  Probability". 
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EXERCISES 


1.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(Z<  -1.72) 

b.  P(Z<  2.05) 

c.  P(Z<0) 

d.  P(Z> -2.11) 

e.  P(Z>  1.63) 

f.  P(Z>  2.36) 

2.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(Z< -1.17) 

b.  P(Z< -0.05) 

c.  P(Z<  0.66) 

d.  P(Z> -2.43) 

e.  P(Z> -1.00) 

f.  P(Z>  2.19) 

3.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(-2.15  <  Z  <  -1.09) 

b.  P(-0.93  <  Z  <  0.55) 

c.  P(0.68  <  Z  <  2.1l) 

4.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(-1.99  <  Z  <  -1.03) 

b.  P(-0.87  <  Z  <  1.58) 

c.  P(0.33  <  Z  <  0.96) 

5.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(-4.22  <  Z  <  -1.39) 

b.  P(-1.37  <  Z  <  5.1l) 

c.  P(Z< -4.31) 

d.  P(Z<  5.02) 
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6.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  probability 
indicated. 

a.  P(Z>- 5.31) 

b.  P(-4.08  <  Z  <  0.58) 

c.  P(Z<  -6.16) 

d.  P(-0.51  <  Z  <  5.63) 

7.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  first  probability 
listed.  Find  the  second  probability  without  referring  to  the  table,  but  using  the 
symmetry  of  the  standard  normal  density  curve  instead.  Sketch  the  density 
curve  with  relevant  regions  shaded  to  illustrate  the  computation. 

a.  P(Z  <  -1.08),  P(Z  >  1.08) 

b.  P(Z  <  -0.36),  P(Z  >  0.36) 

c.  P(Z<  1.25),  P(Z> -1.25) 

d.  P(Z  <  2.03),  P(Z  >  -2.03) 

8.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  first  probability 
listed.  Find  the  second  probability  without  referring  to  the  table,  but  using  the 
symmetry  of  the  standard  normal  density  curve  instead.  Sketch  the  density 
curve  with  relevant  regions  shaded  to  illustrate  the  computation. 

a.  P(Z<  -2.11),  P(Z>  2.11) 

b.  P(Z  <  -0.88),  P(Z  >  0.88) 

C.  P(Z  <  2.44),  P(Z  >  -2.44) 
d.  P(Z  <  3.07),  P(Z  >  -3.07) 

9.  The  probability  that  a  standard  normal  random  variable  Z  takes  a  value  in  the 
union  of  intervals  (-°°,  -a]  U  [a,  °°),  which  arises  in  applications,  will  be  denoted 
P(Z  s  -a  or  Z  a  a).  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the 
following  probabilities  of  this  type.  Sketch  the  density  curve  with  relevant 
regions  shaded  to  illustrate  the  computation.  Because  of  the  symmetry  of  the 
standard  normal  density  curve  you  need  to  use  Figure  12.2  "Cumulative 
Normal  Probability"  only  one  time  for  each  part. 

a.  P(Z  <-1.29  or  Z>  1.29) 

b.  P(Z  <-2.33  or  Z>  2.33) 

c.  P(Z  <-1.96  or  Z>  1.96) 

d.  P(Z  <-3.09  or  Z>  3.09) 

10.  The  probability  that  a  standard  normal  random  variable  Z  takes  a  value  in  the 
union  of  intervals  (-°°,  -a]  U  [a,  °°),  which  arises  in  applications,  will  be  denoted 
P(Z  £  -a  or  Z  a  a).  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the 
following  probabilities  of  this  type.  Sketch  the  density  curve  with  relevant 
regions  shaded  to  illustrate  the  computation.  Because  of  the  symmetry  of  the 
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standard  normal  density  curve  you  need  to  use  Figure  12.2  "Cumulative 
Normal  Probability"  only  one  time  for  each  part. 

a.  P(Z  <-2.58  or  Z>  2.58) 

b.  P(Z  <-2.81  or  Z>  2.81) 

c.  P(Z  <-1.65  or  Z>  1.65) 

d.  P(Z  <-2.43  or  Z>  2.43) 


ANSWERS 

a. 

0.0427 

b. 

0.9798 

c. 

0.5 

d. 

0.9826 

e. 

0.0516 

f. 

0.0091 

a. 

0.1221 

b. 

0.5326 

c. 

0.2309 

a. 

0.0823 

b. 

0.9147 

c. 

0.0000 

d. 

1.0000 

a. 

0.1401,0.1401 

b. 

0.3594,0.3594 

c. 

0.8944,  0.8944 

d. 

0.9788,  0.9788 

a. 

0.1970 

b. 

0.01980 

c. 

0.0500 

d. 

0.0020 
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5.3  Probability  Computations  for  General  Normal  Random  Variables 


LEARNING  OBJECTIVE 

1.  To  learn  how  to  compute  probabilities  related  to  any  normal  random 
variable. 


if  X  is  any  normally  distributed  normal  random  variable  then  Figure  12.2 
"Cumulative  Normal  Probability"  can  also  be  used  to  compute  a  probability  of  the 
form  P  [a  <  X  <  b)  by  means  of  the  following  equality. 


If  X  is  a  normally  distributed  random  variable  with  mean  /i  and  standard 
deviation  a,  then 

P  (a  <  X  <  b)  =  <  Z  <  ~~~\ 

where  Z  denotes  a  standard  normal  random  variable,  a  can  be  any  decimal 
number  or  — oo;  b  can  be  any  decimal  number  or  oo. 


The  new  endpoints  (a  —  fi)  /  aand  {b  —  fi)  /  care  the  z-scores  of  a  and  b  as 
defined  in  Section  2.4.2  in  Chapter  2  "Descriptive  Statistics". 


Figure  5.14  "Probability  for  an  Interval  of  Finite  Length"  illustrates  the  meaning  of 
the  equality  geometrically:  the  two  shaded  regions,  one  under  the  density  curve  for 
X  and  the  other  under  the  density  curve  for  Z,  have  the  same  area.  Instead  of 
drawing  both  bell  curves,  though,  we  will  always  draw  a  single  generic  bell-shaped 
curve  with  both  an  x-axis  and  a  z-axis  below  it. 
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Figure  5.14  Probability  for  an  Interval  of  Finite  Length 

Same  shaded  area 
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EXAMPLE  9 


Let  X  be  a  normal  random  variable  with  mean  /i  =  10  and  standard  deviation 
a  =  2.5.  Compute  the  following  probabilities. 

a.  P(X<  14). 

b.  P(S  <X  <  14). 


Solution: 


a.  See  Figure  5.15  "Probability  Computation  for  a  General  Normal 
Random  Variable". 


P(X  <  14) 


p(z<i±^i) 

/  14-  10  \ 

pyz  k  2.5  ' 

P(z  <  1.60) 
0.9452 


Figure  5.15 

Probability  Computation  for  a  General  Normal  Random  Variable 

Normal  Table:  area  =  0.9452 


0  1.60 
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a.  See  Figure  5.16  "Probability  Computation  for  a  General  Normal 
Random  Variable". 


P(8  <  X  <  14) 


( 


8-10 


<  Z  < 


14-  10 


2.5  2.5 

P  (-0.80  <  Z  <  1.60) 


) 


0.9452-0.2119 

0.7333 


Figure  5.16 

Probability  Computation  for  a  General  Normal  Random  Variable 

Normal  Table:  area  =  0.9452 


4F8  0  L60 
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EXAMPLE  10 


The  lifetimes  of  the  tread  of  a  certain  automobile  tire  are  normally 
distributed  with  mean  37,500  miles  and  standard  deviation  4,500  miles.  Find 
the  probability  that  the  tread  life  of  a  randomly  selected  tire  will  be  between 
30,000  and  40,000  miles. 

Solution: 

Let  X  denote  the  tread  life  of  a  randomly  selected  tire.  To  make  the  numbers 
easier  to  work  with  we  will  choose  thousands  of  miles  as  the  units.  Thus  ft  = 
37.5,  o  =  4.5,  and  the  problem  is  to  compute  P  (30  <  X  <  40)  .  Figure 
5.17  "Probability  Computation  for  Tire  Tread  Wear"  illustrates  the  following 
computation: 

Figure  5.17 

Probability  Computation  for  Tire  Tread  Wear 

Normal  Table:  area  =  0.7123 


-1.67  0  0.56 
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P  (30  <  X  <  40) 


/  30  -  /,/  40  -  /./  \ 

pK— rJ±<z<—-  ) 

O  (7 

/  30  -  37.5  40  -  37.5 

43  <Z<  43 

P  (-1.67  <  Z  <  0.56) 

0.7123  -  0.0475 
0.6648 


Note  that  the  two  z-scores  were  rounded  to  two  decimal  places  in  order  to 
use  Figure  12.2  "Cumulative  Normal  Probability". 


5.3  Probability  Computations  for  General  Normal  Random  Variables 


259 


Chapter  5  Continuous  Random  Variables 


EXAMPLE  11 


Scores  on  a  standardized  college  entrance  examination  (CEE)  are  normally 
distributed  with  mean  510  and  standard  deviation  60.  A  selective  university 
considers  for  admission  only  applicants  with  CEE  scores  over  650.  Find 
percentage  of  all  individuals  who  took  the  CEE  who  meet  the  university's  CEE 
requirement  for  consideration  for  admission. 

Solution: 

Let  X  denote  the  score  made  on  the  CEE  by  a  randomly  selected  individual. 
Then  X  is  normally  distributed  with  mean  510  and  standard  deviation  60. 

The  probability  that  X  lie  in  a  particular  interval  is  the  same  as  the 
proportion  of  all  exam  scores  that  lie  in  that  interval.  Thus  the  solution  to 
the  problem  is  P(X  >  650),  expressed  as  a  percentage.  Figure  5.18  "Probability 
Computation  for  Exam  Scores"  illustrates  the  following  computation: 

Figure  5.18 

Probability  Computation  for  Exam  Scores 

Normal  Table:  area  =  0.9901 


0  T33 
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P  (X  >  650) 


( 


P\Z> 


650  —  /u 


) 


( 


P\Z> 


o 

650  -  510 
60 


) 


P(Z>  2.33) 
1  -  0.9901 
0.0099 


The  proportion  of  all  CEE  scores  that  exceed  650  is  0.0099,  hence  0.99%  or 
about  1%  do. 


KEY  TAKEAWAY 


•  Probabilities  for  a  general  normal  random  variable  are  computed  using 
Figure  12.2  "Cumulative  Normal  Probability"  after  converting  x-values 
to  z-scores. 
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1.  X  is  a  normally  distributed  random  variable  with  mean  57  and  standard 
deviation  6.  Find  the  probability  indicated. 

a.  P(X<  59.5) 

b.  P(X<  46.2) 

c.  P(X>  52.2) 

d.  P(X>  70) 

2.  X  is  a  normally  distributed  random  variable  with  mean  -25  and  standard 
deviation  4.  Find  the  probability  indicated. 

a.  P(X<- 27.2) 

b.  P(X  <  -14.8) 

c.  P(X  >  -33. l) 

d.  P(X  >  -16.5) 

3.  X  is  a  normally  distributed  random  variable  with  mean  112  and  standard 
deviation  15.  Find  the  probability  indicated. 

a.  P(100  <X  <  125) 

b.  P(91  <X  <  107) 

c.  P  ( 1 18  <X  <  160) 

4.  X  is  a  normally  distributed  random  variable  with  mean  72  and  standard 
deviation  22.  Find  the  probability  indicated. 

a.  P  (78  <  X  <  127) 

b.  P(60<X<  90) 

c.  P  (49  <  X  <  71) 

5.  X  is  a  normally  distributed  random  variable  with  mean  500  and  standard 
deviation  25.  Find  the  probability  indicated. 

a.  P(X  <  400) 

b.  P  (466  <  X  <  625) 

6.  X  is  a  normally  distributed  random  variable  with  mean  0  and  standard 
deviation  0.75.  Find  the  probability  indicated. 

a.  P(-4.02  <  X  <  3.82) 
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b.  P(X  >  4.1l) 

7.  X  is  a  normally  distributed  random  variable  with  mean  15  and  standard 
deviation  1.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  first 
probability  listed.  Find  the  second  probability  using  the  symmetry  of  the 
density  curve.  Sketch  the  density  curve  with  relevant  regions  shaded  to 
illustrate  the  computation. 

a.  P(X  <  12),  P(X  >  18) 

b.  P(X  <  14),  P(X  >  16) 

c.  P(X<  11.25),  P(X>  18.75) 

d.  P(X  <  12.67),  P(X  >  17.33) 

8.  X  is  a  normally  distributed  random  variable  with  mean  100  and  standard 
deviation  10.  Use  Figure  12.2  "Cumulative  Normal  Probability"  to  find  the  first 
probability  listed.  Find  the  second  probability  using  the  symmetry  of  the 
density  curve.  Sketch  the  density  curve  with  relevant  regions  shaded  to 
illustrate  the  computation. 

a.  P(X  <  80),  P(X  >  120) 

b.  P(X  <  75),  P(X  >  125) 

C.  P(X  <  84.55),  P(X  >  115.45) 

d.  P(X  <  77.42),  P(X  >  122.58) 

9.  X  is  a  normally  distributed  random  variable  with  mean  67  and  standard 
deviation  13.  The  probability  that  X  takes  a  value  in  the  union  of  intervals 
( —00,  67  -  a\  U  [67  +  a ,  00)  will  be  denoted 

P  (X  <  67  —  a  OfX  >  67  +  a)  .  Use  Figure  12.2  "Cumulative  Normal 
Probability"  to  find  the  following  probabilities  of  this  type.  Sketch  the  density 
curve  with  relevant  regions  shaded  to  illustrate  the  computation.  Because  of 
the  symmetry  of  the  density  curve  you  need  to  use  Figure  12.2  "Cumulative 
Normal  Probability"  only  one  time  for  each  part. 

a.  P  (X  <  57  orX  >  77) 

b.  P  (X  <  47  orX  >  87) 

c.  P  (X  <  49  orX  >  85) 

d.  P  (X  <  37  orX  >  97) 

10.  X  is  a  normally  distributed  random  variable  with  mean  288  and  standard 
deviation  6.  The  probability  that  X  takes  a  value  in  the  union  of  intervals 
(—co,288  —  a  U  [288  +  a,  00)  will  be  denoted 

P  (X  <  288  —  a  OrX  >  288  +  a)  .  Use  Figure  12.2  "Cumulative 
Normal  Probability"  to  find  the  following  probabilities  of  this  type.  Sketch  the 
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density  curve  with  relevant  regions  shaded  to  illustrate  the  computation. 
Because  of  the  symmetry  of  the  density  curve  you  need  to  use  Figure  12.2 
"Cumulative  Normal  Probability"  only  one  time  for  each  part. 

a.  P  (X  <  278  orX  >  298) 

b.  P  (X  <  268  orX  >  308) 

c.  P  (X  <  273  orX  >  303) 

d.  P  (X  <  280  orX  >  296) 


APPLICATIONS 


11.  The  amount  X  of  beverage  in  a  can  labeled  12  ounces  is  normally  distributed 
with  mean  12.1  ounces  and  standard  deviation  0.05  ounce.  A  can  is  selected  at 
random. 

a.  Find  the  probability  that  the  can  contains  at  least  12  ounces. 

b.  Find  the  probability  that  the  can  contains  between  11.9  and  12.1  ounces. 

12.  The  length  of  gestation  for  swine  is  normally  distributed  with  mean  114  days 
and  standard  deviation  0.75  day.  Find  the  probability  that  a  litter  will  be  born 
within  one  day  of  the  mean  of  114. 

13.  The  systolic  blood  pressure  X  of  adults  in  a  region  is  normally  distributed  with 
mean  112  mm  Hg  and  standard  deviation  15  mm  Hg.  A  person  is  considered 
“prehypertensive”  if  his  systolic  blood  pressure  is  between  120  and  130  mm 
Hg.  Find  the  probability  that  the  blood  pressure  of  a  randomly  selected  person 
is  prehypertensive. 

14.  Heights  X  of  adult  women  are  normally  distributed  with  mean  63.7  inches  and 
standard  deviation  2.71  inches.  Romeo,  who  is  69.25  inches  tall,  wishes  to  date 
only  women  who  are  shorter  than  he  but  within  4  inches  of  his  height.  Find  the 
probability  that  the  next  woman  he  meets  will  have  such  a  height. 

15.  Heights  X  of  adult  men  are  normally  distributed  with  mean  69.1  inches  and 
standard  deviation  2.92  inches.  Juliet,  who  is  63.25  inches  tall,  wishes  to  date 
only  men  who  are  taller  than  she  but  within  6  inches  of  her  height.  Find  the 
probability  that  the  next  man  she  meets  will  have  such  a  height. 

16.  A  regulation  hockey  puck  must  weigh  between  5.5  and  6  ounces.  The  weights  X 
of  pucks  made  by  a  particular  process  are  normally  distributed  with  mean  5.75 
ounces  and  standard  deviation  0.11  ounce.  Find  the  probability  that  a  puck 
made  by  this  process  will  meet  the  weight  standard. 
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17.  A  regulation  golf  ball  may  not  weigh  more  than  1.620  ounces.  The  weights  X  of 
golf  balls  made  by  a  particular  process  are  normally  distributed  with  mean 
1.361  ounces  and  standard  deviation  0.09  ounce.  Find  the  probability  that  a 
golf  ball  made  by  this  process  will  meet  the  weight  standard. 

18.  The  length  of  time  that  the  battery  in  Hippolyta's  cell  phone  will  hold  enough 
charge  to  operate  acceptably  is  normally  distributed  with  mean  25.6  hours  and 
standard  deviation  0.32  hour.  Hippolyta  forgot  to  charge  her  phone  yesterday, 
so  that  at  the  moment  she  first  wishes  to  use  it  today  it  has  been  26  hours  18 
minutes  since  the  phone  was  last  fully  charged.  Find  the  probability  that  the 
phone  will  operate  properly. 

19.  The  amount  of  non-mortgage  debt  per  household  for  households  in  a 
particular  income  bracket  in  one  part  of  the  country  is  normally  distributed 
with  mean  $28,350  and  standard  deviation  $3,425.  Find  the  probability  that  a 
randomly  selected  such  household  has  between  $20,000  and  $30,000  in  non¬ 
mortgage  debt. 

20.  Birth  weights  of  full-term  babies  in  a  certain  region  are  normally  distributed 
with  mean  7.125  lb  and  standard  deviation  1.290  lb.  Find  the  probability  that  a 
randomly  selected  newborn  will  weigh  less  than  5.5  lb,  the  historic  definition 
of  prematurity. 

21.  The  distance  from  the  seat  back  to  the  front  of  the  knees  of  seated  adult  males 
is  normally  distributed  with  mean  23.8  inches  and  standard  deviation  1.22 
inches.  The  distance  from  the  seat  back  to  the  back  of  the  next  seat  forward  in 
all  seats  on  aircraft  flown  by  a  budget  airline  is  26  inches.  Find  the  proportion 
of  adult  men  flying  with  this  airline  whose  knees  will  touch  the  back  of  the  seat 
in  front  of  them. 

22.  The  distance  from  the  seat  to  the  top  of  the  head  of  seated  adult  males  is 
normally  distributed  with  mean  36.5  inches  and  standard  deviation  1.39 
inches.  The  distance  from  the  seat  to  the  roof  of  a  particular  make  and  model 
car  is  40.5  inches.  Find  the  proportion  of  adult  men  who  when  sitting  in  this 
car  will  have  at  least  one  inch  of  headroom  (distance  from  the  top  of  the  head 
to  the  roof). 


ADDITIONAL  EXERCISES 


23.  The  useful  life  of  a  particular  make  and  type  of  automotive  tire  is  normally 
distributed  with  mean  57,500  miles  and  standard  deviation  950  miles. 

a.  Find  the  probability  that  such  a  tire  will  have  a  useful  life  of  between 
57,000  and  58,000  miles. 
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b.  Hamlet  buys  four  such  tires.  Assuming  that  their  lifetimes  are 

independent,  find  the  probability  that  all  four  will  last  between  57,000  and 
58,000  miles,  (if  so,  the  best  tire  will  have  no  more  than  1,000  miles  left  on 
it  when  the  first  tire  fails.)  Hint:  There  is  a  binomial  random  variable  here, 
whose  value  of  p  comes  from  part  (a). 

24.  A  machine  produces  large  fasteners  whose  length  must  be  within  0.5  inch  of  22 
inches.  The  lengths  are  normally  distributed  with  mean  22.0  inches  and 
standard  deviation  0.17  inch. 

a.  Find  the  probability  that  a  randomly  selected  fastener  produced  by  the 
machine  will  have  an  acceptable  length. 

b.  The  machine  produces  20  fasteners  per  hour.  The  length  of  each  one  is 
inspected.  Assuming  lengths  of  fasteners  are  independent,  find  the 
probability  that  all  20  will  have  acceptable  length.  Hint:  There  is  a  binomial 
random  variable  here,  whose  value  of  p  comes  from  part  (a). 

25.  The  lengths  of  time  taken  by  students  on  an  algebra  proficiency  exam  (if  not 
forced  to  stop  before  completing  it)  are  normally  distributed  with  mean  28 
minutes  and  standard  deviation  1.5  minutes. 

a.  Find  the  proportion  of  students  who  will  finish  the  exam  if  a  30-minute 
time  limit  is  set. 

b.  Six  students  are  taking  the  exam  today.  Find  the  probability  that  all  six 
will  finish  the  exam  within  the  30-minute  limit,  assuming  that  times  taken 
by  students  are  independent.  Hint:  There  is  a  binomial  random  variable 
here,  whose  value  of  p  comes  from  part  (a). 

26.  Heights  of  adult  men  between  18  and  34  years  of  age  are  normally  distributed 
with  mean  69.1  inches  and  standard  deviation  2.92  inches.  One  requirement  for 
enlistment  in  the  military  is  that  men  must  stand  between  60  and  80  inches 
tall. 

a.  Find  the  probability  that  a  randomly  elected  man  meets  the  height 
requirement  for  military  service. 

b.  Twenty-three  men  independently  contact  a  recruiter  this  week.  Find  the 
probability  that  all  of  them  meet  the  height  requirement.  Hint:  There  is  a 
binomial  random  variable  here,  whose  value  of  p  comes  from  part  (a). 

27.  A  regulation  hockey  puck  must  weigh  between  5.5  and  6  ounces.  In  an 
alternative  manufacturing  process  the  mean  weight  of  pucks  produced  is  5.75 
ounce.  The  weights  of  pucks  have  a  normal  distribution  whose  standard 
deviation  can  be  decreased  by  increasingly  stringent  (and  expensive)  controls 
on  the  manufacturing  process.  Find  the  maximum  allowable  standard 
deviation  so  that  at  most  0.005  of  all  pucks  will  fail  to  meet  the  weight 
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standard.  (Hint:  The  distribution  is  symmetric  and  is  centered  at  the  middle  of 
the  interval  of  acceptable  weights.) 

28.  The  amount  of  gasoline  X  delivered  by  a  metered  pump  when  it  registers  5 
gallons  is  a  normally  distributed  random  variable.  The  standard  deviation  a  of 
X  measures  the  precision  of  the  pump;  the  smaller  a  is  the  smaller  the 
variation  from  delivery  to  delivery.  A  typical  standard  for  pumps  is  that  when 
they  show  that  5  gallons  of  fuel  has  been  delivered  the  actual  amount  must  be 
between  4.97  and  5.03  gallons  (which  corresponds  to  being  off  by  at  most  about 
half  a  cup).  Supposing  that  the  mean  of  X  is  5,  find  the  largest  that  a  can  be  so 
that  P(4.97  <  X  <  5.03)  is  1.0000  to  four  decimal  places  when  computed  using 
Figure  12.2  "Cumulative  Normal  Probability",  which  means  that  the  pump  is 
sufficiently  accurate.  (Hint:  The  z-score  of  5.03  will  be  the  smallest  value  of  Z  so 
that  Figure  12.2  "Cumulative  Normal  Probability"  gives 

P(Z  <  z)  =  1.0000.) 
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ANSWERS 


1. 

a. 

0.6628 

b. 

0.7881 

c. 

0.0359 

d. 

0.0150 

3. 

a. 

0.5959 

b. 

0.2899 

c. 

0.3439 

5. 

a. 

0.0000 

b. 

0.9131 

7. 

a. 

0.0013,0.0013 

b. 

0.1587,0.1587 

c. 

0.0001,  0.0001 

d. 

0.0099,  0.0099 

9. 

a. 

0.4412 

b. 

0.1236 

c. 

0.1676 

d. 

0.0208 

11. 

a. 

0.9772 

b. 

0.5000 

13.  0.1830 

15.  0.4971 

17.  0.9980 

19.  0.6771 

21.  0.0359 

23.  a.  0.4038 

b.  0.0266 

25.  a.  0.9082 

b.  0.5612 

27.  0.089 


5.3  Probability  Computations  for  General  Normal  Random  Variables 


268 


Chapter  5  Continuous  Random  Variables 


5.4  Areas  of  Tails  of  Distributions 


LEARNING  OBJECTIVE 

1.  To  learn  how  to  find,  for  a  normal  random  variable  X  and  an  area  a,  the 
value  X*  of  X  so  that  P  (X  <  X*)  =  aorthatP(X  >  X*)  =  a, 
whichever  is  required. 


The  probabilities  tabulated  in  Figure  12.2  "Cumulative  Normal  Probability"  are 
areas  of  left  tails  in  the  standard  normal  distribution. 

Tails  of  the  Standard  Normal  Distribution 


5.  The  region  under  a  density 
curve  whose  area  is  either 
P(X  <  X*)  or  P  (X  >  X*) 
for  some  number  X* . 


At  times  it  is  important  to  be  able  to  solve  the  kind  of  problem  illustrated  by  Figure 
5.20.  We  have  a  certain  specific  area  in  mind,  in  this  case  the  area  0.0125  of  the 
shaded  region  in  the  figure,  and  we  want  to  find  the  value  z*  of  Z  that  produces  it. 
This  is  exactly  the  reverse  of  the  kind  of  problems  encountered  so  far.  Instead  of 
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knowing  a  value  z*  of  Z  and  finding  a  corresponding  area,  we  know  the  area  and 
want  to  find  z*.  In  the  case  at  hand,  in  the  terminology  of  the  definition  just  above, 
we  wish  to  find  the  value  z*  that  cuts  off  a  left  tail  of  area  0.0125  in  the  standard 
normal  distribution. 


The  idea  for  solving  such  a  problem  is  fairly  simple,  although  sometimes  its 
implementation  can  be  a  bit  complicated.  In  a  nutshell,  one  reads  the  cumulative 
probability  table  for  Z  in  reverse,  looking  up  the  relevant  area  in  the  interior  of  the 
table  and  reading  off  the  value  of  Z  from  the  margins. 


Figure  5.20  Z  Value  that  Produces  a  Known  Area 


5.4  Areas  of  Tails  of  Distributions 
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EXAMPLE  12 


Find  the  value  Z*  of  Z  as  determined  by  Figure  5.20:  the  value  Z*  that  cuts 
off  a  left  tail  of  area  0.0125  in  the  standard  normal  distribution.  In  symbols, 
find  the  number  Z*  such  that  P  (Z  <  Z*)  =  0.0125. 

Solution: 

The  number  that  is  known,  0.0125,  is  the  area  of  a  left  tail,  and  as  already 
mentioned  the  probabilities  tabulated  in  Figure  12.2  "Cumulative  Normal 
Probability"  are  areas  of  left  tails.  Thus  to  solve  this  problem  we  need  only 
search  in  the  interior  of  Figure  12.2  "Cumulative  Normal  Probability"  for  the 
number  0.0125.  It  lies  in  the  row  with  the  heading  -2.2  and  in  the  column 
with  the  heading  0.04.  This  means  that  P(Z  <  -2.24)  =  0.0125,  hence 

z*  =  -2.24. 


5.4  Areas  of  Tails  of  Distributions 
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EXAMPLE  13 


Find  the  value  Z*  of  Z  as  determined  by  Figure  5.21:  the  value  Z*  that  cuts 
off  a  right  tail  of  area  0.0250  in  the  standard  normal  distribution.  In  symbols, 
find  the  number  Z*  such  that  P  (Z  >  Z*)  =  0.0250. 


Figure  5.21 

2  Value  that  Produces  a  Known  Area 


Area  =  1  -  0.0250  =  0.9750 


Solution: 

The  important  distinction  between  this  example  and  the  previous  one  is  that 
here  it  is  the  area  of  a  right  tail  that  is  known.  In  order  to  be  able  to  use 
Figure  12.2  "Cumulative  Normal  Probability"  we  must  first  find  that  area  of 
the  left  tail  cut  off  by  the  unknown  number  Z * .  Since  the  total  area  under 
the  density  curve  is  1,  that  area  is  1  —  0.0250  =  0.9750.  This  is  the 

number  we  look  for  in  the  interior  of  Figure  12.2  "Cumulative  Normal 
Probability".  It  lies  in  the  row  with  the  heading  1.9  and  in  the  column  with 
the  heading  0.06.  Therefore  Z*  =  1 .96. 
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Definition 

The  value  of  the  standard  normal  random  variable  Z  that  cuts  off  a  right  tail  of  area  c  is 
denoted  zc.  By  symmetry,  value  of  Z  that  cuts  off  a  left  tail  of  area  c  is  —zc-  See  Figure 
5.22  "The  Numbers  ". 

Figure  5.22 

The  Numbers  zc  and  — Zc 


The  previous  two  examples  were  atypical  because  the  areas  we  were  looking  for  in 
the  interior  of  Figure  12.2  "Cumulative  Normal  Probability"  were  actually  there. 
The  following  example  illustrates  the  situation  that  is  more  common. 


5.4  Areas  of  Tails  of  Distributions 
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EXAMPLE  14 


Find  Z.oi  and  — Z.01  >  the  values  of  Z  that  cut  off  right  and  left  tails  of  area 
0.01  in  the  standard  normal  distribution. 

Solution: 

Since  — Z.01  cuts  off  a  left  tail  of  area  0.01  and  Figure  12.2  "Cumulative 
Normal  Probability"  is  a  table  of  left  tails,  we  look  for  the  number  0.0100  in 
the  interior  of  the  table.  It  is  not  there,  but  falls  between  the  two  numbers 
0.0102  and  0.0099  in  the  row  with  heading  -2.3.  The  number  0.0099  is  closer 
to  0.0100  than  0.0102  is,  so  for  the  hundredths  place  in  — Z.01  we  use  the 
heading  of  the  column  that  contains  0.0099,  namely,  0.03,  and  write 

-Z.01  «  -2.33. 

The  answer  to  the  second  half  of  the  problem  is  automatic:  since 
— Z.01  —  —2.33 ,  we  conclude  immediately  that  Z. 01  —  2.33. 

We  could  just  as  well  have  solved  this  problem  by  looking  for  Z. 01  first,  and 
it  is  instructive  to  rework  the  problem  this  way.  To  begin  with,  we  must  first 
subtract  0.01  from  1  to  find  the  area  1  —  0.0100  =  0.9900  of  the  left 
tail  cut  off  by  the  unknown  number  Z.oi  •  See  Figure  5.23  "Computation  of 
the  Number  ",  Then  we  search  for  the  area  0.9900  in  Figure  12.2  "Cumulative 
Normal  Probability".  It  is  not  there,  but  falls  between  the  numbers  0.9898 
and  0.9901  in  the  row  with  heading  2.3.  Since  0.9901  is  closer  to  0.9900  than 
0.9898  is,  we  use  the  column  heading  above  it,  0.03,  to  obtain  the 

approximation  Z.OI  ~  2.33.  Then  finally  ~Z. 01  ~  —2.33. 
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Figure  5.23 

Computation  of  the  Number  z.oi 

Area  =  1  -0.01  =0.99 


Tails  of  General  Normal  Distributions 

The  problem  of  finding  the  value  X*  of  a  general  normally  distributed  random 
variable  X  that  cuts  off  a  tail  of  a  specified  area  also  arises.  This  problem  may  be 
solved  in  two  steps. 


Suppose  X  is  a  normally  distributed  random  variable  with  mean  ft  and  standard 
deviation  a.  To  find  the  value  x*  of  X  that  cuts  off  a  left  or  right  tail  of  area  c  in 
the  distribution  of  X: 

1.  find  the  value  z*  of  Z  that  cuts  off  a  left  or  right  tail  of  area  c  in  the 
standard  normal  distribution; 

2.  z*  is  the  z-score  of  x*;  compute  x*  using  the  destandardization 
formula 

x*  =  fi  +  z*o 


In  short,  solve  the  corresponding  problem  for  the  standard  normal  distribution, 
thereby  obtaining  the  z-score  of  x*,  then  destandardize  to  obtain  x*. 
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EXAMPLE  15 


FindX*  such  that  P  (X  <  X*)  =  0.9332  ,  where  X  is  a  normal  random 
variable  with  mean  /i  =  10  and  standard  deviation  a  =  2.5. 

Solution: 

All  the  ideas  for  the  solution  are  illustrated  in  Figure  5.24  "Tail  of  a  Normally 
Distributed  Random  Variable".  Since  0.9332  is  the  area  of  a  left  tail,  we  can 
find  Z*  simply  by  looking  for  0.9332  in  the  interior  of  Figure  12.2 
"Cumulative  Normal  Probability".  It  is  in  the  row  and  column  with  headings 
1.5  and  0.00,  hence  Z*  =  1 .50.  Thus  X*  is  1.50  standard  deviations  above 
the  mean,  so 


X*  =  n  +  z*(T  =  10  +  1.50  •  2.5  =  13.75. 


Figure  5.24 

Tail  of  a  Normally  Distributed  Random  Variable 

Area  =  0.9332 


0 
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EXAMPLE  16 


Find  X*  such  that  P  (X  >  X*)  =  0.65  ,  where  X  is  a  normal  random 
variable  with  mean  /i  =  175  and  standard  deviation  a  =  12. 

Solution: 

The  situation  is  illustrated  in  Figure  5.25  "Tail  of  a  Normally  Distributed 
Random  Variable".  Since  0.65  is  the  area  of  a  right  tail,  we  first  subtract  it 
from  1  to  obtain  1  —  0.65  =  0.35  ,  the  area  of  the  complementary  left 
tail.  We  find  Z*  by  looking  for  0.3500  in  the  interior  of  Figure  12,2 
"Cumulative  Normal  Probability".  It  is  not  present,  but  lies  between  table 
entries  0.3520  and  0.3483.  The  entry  0.3483  with  row  and  column  headings 
-0.3  and  0.09  is  closer  to  0.3500  than  the  other  entry  is,  so  Z*  «  —0.39. 
Thus  X*  is  0.39  standard  deviations  below  the  mean,  so 

x*  =  ti  +  z*a  =  175  +  (-0.39)  •  12  =  170.32 


Figure  5.25 

Tail  of  a  Normally  Distributed  Random  Variable 

Area  =  0.65 

i - > 


Area  =  1  -  0.65  =  0.35 


z*  0 
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EXAMPLE  17 


Scores  on  a  standardized  college  entrance  examination  (CEE)  are  normally 
distributed  with  mean  510  and  standard  deviation  60.  A  selective  university 
decides  to  give  serious  consideration  for  admission  to  applicants  whose  CEE 
scores  are  in  the  top  5%  of  all  CEE  scores.  Find  the  minimum  score  that  meets 
this  criterion  for  serious  consideration  for  admission. 

Solution: 

Let  X  denote  the  score  made  on  the  CEE  by  a  randomly  selected  individual. 
Then  X  is  normally  distributed  with  mean  510  and  standard  deviation  60. 

The  probability  that  X  lie  in  a  particular  interval  is  the  same  as  the 
proportion  of  all  exam  scores  that  lie  in  that  interval.  Thus  the  minimum 
score  that  is  in  the  top  5%  of  all  CEE  is  the  score  X*  that  cuts  off  a  right  tail 
in  the  distribution  of  X  of  area  0.05  (5%  expressed  as  a  proportion).  See 
Figure  5.26  "Tail  of  a  Normally  Distributed  Random  Variable". 

Figure  5.26 

Tail  of  a  Normally  Distributed  Random  Variable 

Area  =  1  -  0.05  =  0.95 


Since  0.0500  is  the  area  of  a  right  tail,  we  first  subtract  it  from  1  to  obtain 
1  —0.0500  =  0.9500  ,  the  area  of  the  complementary  left  tail.  We  find 
Z*  =  Z. 05  by  looking  for  0.9500  in  the  interior  of  Figure  12.2  "Cumulative 
Normal  Probability".  It  is  not  present,  and  lies  exactly  half-way  between  the 
two  nearest  entries  that  are,  0.9495  and  0.9505.  In  the  case  of  a  tie  like  this, 
we  will  always  average  the  values  of  Z  corresponding  to  the  two  table 
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entries,  obtaining  here  the  value  Z*  =  1.645.  Using  this  value,  we 
conclude  that  X*  is  1.645  standard  deviations  above  the  mean,  so 

x*  =  ji  +  z*<7  =  510  +  1.645  •  60  =  608.7 
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EXAMPLE  18 


All  boys  at  a  military  school  must  run  a  fixed  course  as  fast  as  they  can  as 
part  of  a  physical  examination.  Finishing  times  are  normally  distributed 
with  mean  29  minutes  and  standard  deviation  2  minutes.  The  middle  75%  of 
all  finishing  times  are  classified  as  “average.”  Find  the  range  of  times  that 
are  average  finishing  times  by  this  definition. 

Solution: 

Let  X  denote  the  finish  time  of  a  randomly  selected  boy.  Then  X  is  normally 
distributed  with  mean  29  and  standard  deviation  2.  The  probability  that  X  lie 
in  a  particular  interval  is  the  same  as  the  proportion  of  all  finish  times  that 
lie  in  that  interval.  Thus  the  situation  is  as  shown  in  Figure  5,27 
"Distribution  of  Times  to  Run  a  Course".  Because  the  area  in  the  middle 
corresponding  to  “average”  times  is  0.75,  the  areas  of  the  two  tails  add  up  to 
1  -  0.75  =  0.25  in  all.  By  the  symmetry  of  the  density  curve  each  tail  must 
have  half  of  this  total,  or  area  0.125  each.  Thus  the  fastest  time  that  is 
“average”  has  z-score  —  Z.\25  >  which  by  Figure  12.2  "Cumulative  Normal 
Probability"  is  -1.15,  and  the  slowest  time  that  is  “average”  has  z-score 
Z .125  —  1.15.  The  fastest  and  slowest  times  that  are  still  considered 
average  are 


X  fast  =  P  +  (—£.125  )  &  =  29  +  (-1.15)  •  2  =  26.7 

and 

Xslow  =  M  +  £.125  =  29  +  (1.15)  •  2  =  31.3 
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Figure  5.27 

Distribution  of  Times  to  Run  a  Course 

Area  =  0  .75 


X fast  29  X  slow 


'£0.125  0  £0.125 


A  boy  has  an  average  finishing  time  if  he  runs  the  course  with  a  time 
between  26.7  and  31.3  minutes,  or  equivalently  between  26  minutes  42 
seconds  and  31  minutes  18  seconds. 


KEY  TAKEAWAYS 


•  The  problem  of  finding  the  number  Z*  so  that  the  probability 

p  (Z  <  Z*)  is  a  specified  value  c  is  solved  by  looking  for  the  number  c 
in  the  interior  of  Figure  12.2  "Cumulative  Normal  Probability"  and 
reading  Z:,;  from  the  margins. 

•  The  problem  of  finding  the  number  Z*  so  that  the  probability 
p  (Z  >  z*)  is  a  specified  value  c  is  solved  by  looking  for  the 
complementary  probability  1  —  C  in  the  interior  of  Figure  12,2 
"Cumulative  Normal  Probability"  and  reading  Z*  from  the  margins. 

•  For  a  normal  random  variable  X  with  mean  ft  and  standard  deviation  a, 
the  problem  of  finding  the  number  X*  so  that  P  (X  <  X*)  is  a 
specified  value  c  (or  so  that  P  (X  >  X*)  is  a  specified  value  c)  is  solved 
in  two  steps:  (l)  solve  the  corresponding  problem  for  Z  with  the  same 
value  of  c,  thereby  obtaining  the  z-score,  Z *,  of  X* ;  (2)  find  X*  using 
X*  =  /A  +  Z*  •  £7. 

•  The  value  of  Z  that  cuts  off  a  right  tail  of  area  c  in  the  standard  normal 
distribution  is  denoted  zc. 
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1.  Find  the  value  of  Z*  that  yields  the  probability  shown. 

a.  P(Z  <  z*)  =  0.0075 

b.  P(Z  <  z*)  =  0.9850 

c.  P(Z  >  z*)  =  0.8997 

d.  P{Z  >  z*)  =  0.0110 

2.  Find  the  value  of  Z*  that  yields  the  probability  shown. 

a.  P  (Z  <  z*)  =  0.3300 

b.  P(Z  <  z*)  =  0.9901 

c.  P  (Z  >  z*)  =  0.0055 

d.  P(Z  >  z*)  =  0.7995 

3.  Find  the  value  of  Z*  that  yields  the  probability  shown. 

a.  P(Z<  z*)  =  0.1500 

b.  P(Z  <  z*)  =  0.7500 

c.  P(Z  >  z*)  =  0.3333 

d.  P  (Z  >  z*)  =  0.8000 

4.  Find  the  value  of  Z*  that  yields  the  probability  shown. 

a.  P  (Z  <  z*)  =  0.2200 

b.  P  (Z  <  z*)  =  0.6000 

c.  P(Z  >  z*)  =  0.0750 

d.  P  (Z  >  z*)  =  0.8200 

5.  Find  the  indicated  value  of  Z.  (it  is  easier  to  find  —  Zc  and  negate  it.) 

a.  zo.025 

b.  zo.20 

6.  Find  the  indicated  value  of  Z.  (it  is  easier  to  find  —  Zc  and  negate  it.) 

a.  zo.002 

b.  zo.02 

7.  Find  the  value  of  X*  that  yields  the  probability  shown,  where  X  is  a  normally 
distributed  random  variable  X  with  mean  83  and  standard  deviation  4. 

a.  P(X  <  x*)  =  0.8700 
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b.  P(X  >  x*)  =  0.0500 

8.  Find  the  value  of  X*  that  yields  the  probability  shown,  where  X  is  a  normally 
distributed  random  variable  X  with  mean  54  and  standard  deviation  12. 

a.  P  (X  <  x*)  =  0.0900 

b.  P  (X  >  x*)  =  0.6500 

9.  X  is  a  normally  distributed  random  variable  X  with  mean  15  and  standard 
deviation  0.25.  Find  the  values  xl  and  xr  of  X  that  are  symmetrically  located 
with  respect  to  the  mean  of  X  and  satisfy  P(xl  <X<xr)  =  0.80.  (Hint.  First  solve 
the  corresponding  problem  for  Z.) 

10.  X  is  a  normally  distributed  random  variable  X  with  mean  28  and  standard 
deviation  3.7.  Find  the  values  xl  and  xr  of  X  that  are  symmetrically  located 
with  respect  to  the  mean  of  X  and  satisfy  P(xl  <  X  <  xr)  =  0.65.  (Hint.  First  solve 
the  corresponding  problem  for  Z.) 


APPLICATIONS 


11.  Scores  on  a  national  exam  are  normally  distributed  with  mean  382  and 
standard  deviation  26. 

a.  Find  the  score  that  is  the  50th  percentile. 

b.  Find  the  score  that  is  the  90th  percentile. 

12.  Heights  of  women  are  normally  distributed  with  mean  63.7  inches  and 
standard  deviation  2.47  inches. 

a.  Find  the  height  that  is  the  10th  percentile. 

b.  Find  the  height  that  is  the  80th  percentile. 

13.  The  monthly  amount  of  water  used  per  household  in  a  small  community  is 
normally  distributed  with  mean  7,069  gallons  and  standard  deviation  58 
gallons.  Find  the  three  quartiles  for  the  amount  of  water  used. 

14.  The  quantity  of  gasoline  purchased  in  a  single  sale  at  a  chain  of  filling  stations 
in  a  certain  region  is  normally  distributed  with  mean  11.6  gallons  and  standard 
deviation  2.78  gallons.  Find  the  three  quartiles  for  the  quantity  of  gasoline 
purchased  in  a  single  sale. 

15.  Scores  on  the  common  final  exam  given  in  a  large  enrollment  multiple  section 
course  were  normally  distributed  with  mean  69.35  and  standard  deviation 
12.93.  The  department  has  the  rule  that  in  order  to  receive  an  A  in  the  course 
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his  score  must  be  in  the  top  10%  of  all  exam  scores.  Find  the  minimum  exam 
score  that  meets  this  requirement. 

16.  The  average  finishing  time  among  all  high  school  boys  in  a  particular  track 
event  in  a  certain  state  is  5  minutes  17  seconds.  Times  are  normally  distributed 
with  standard  deviation  12  seconds. 

a.  The  qualifying  time  in  this  event  for  participation  in  the  state  meet  is  to  be 
set  so  that  only  the  fastest  5%  of  all  runners  qualify.  Find  the  qualifying 
time.  (Hint:  Convert  seconds  to  minutes.) 

b.  In  the  western  region  of  the  state  the  times  of  all  boys  running  in  this 
event  are  normally  distributed  with  standard  deviation  12  seconds,  but 
with  mean  5  minutes  22  seconds.  Find  the  proportion  of  boys  from  this 
region  who  qualify  to  run  in  this  event  in  the  state  meet. 

17.  Tests  of  a  new  tire  developed  by  a  tire  manufacturer  led  to  an  estimated  mean 
tread  life  of  67,350  miles  and  standard  deviation  of  1,120  miles.  The 
manufacturer  will  advertise  the  lifetime  of  the  tire  (for  example,  a  “50,000  mile 
tire”)  using  the  largest  value  for  which  it  is  expected  that  98%  of  the  tires  will 
last  at  least  that  long.  Assuming  tire  life  is  normally  distributed,  find  that 
advertised  value. 

18.  Tests  of  a  new  light  led  to  an  estimated  mean  life  of  1,321  hours  and  standard 
deviation  of  106  hours.  The  manufacturer  will  advertise  the  lifetime  of  the  bulb 
using  the  largest  value  for  which  it  is  expected  that  90%  of  the  bulbs  will  last  at 
least  that  long.  Assuming  bulb  life  is  normally  distributed,  find  that  advertised 
value. 

19.  The  weights  X  of  eggs  produced  at  a  particular  farm  are  normally  distributed 
with  mean  1.72  ounces  and  standard  deviation  0.12  ounce.  Eggs  whose  weights 
lie  in  the  middle  75%  of  the  distribution  of  weights  of  all  eggs  are  classified  as 
“medium.”  Find  the  maximum  and  minimum  weights  of  such  eggs.  (These 
weights  are  endpoints  of  an  interval  that  is  symmetric  about  the  mean  and  in 
which  the  weights  of  75%  of  the  eggs  produced  at  this  farm  lie.) 

20.  The  lengths  X  of  hardwood  flooring  strips  are  normally  distributed  with  mean 
28.9  inches  and  standard  deviation  6.12  inches.  Strips  whose  lengths  lie  in  the 
middle  80%  of  the  distribution  of  lengths  of  all  strips  are  classified  as  “average- 
length  strips.”  Find  the  maximum  and  minimum  lengths  of  such  strips.  (These 
lengths  are  endpoints  of  an  interval  that  is  symmetric  about  the  mean  and  in 
which  the  lengths  of  80%  of  the  hardwood  strips  lie.) 

21.  All  students  in  a  large  enrollment  multiple  section  course  take  common  in- 
class  exams  and  a  common  final,  and  submit  common  homework  assignments. 
Course  grades  are  assigned  based  on  students'  final  overall  scores,  which  are 
approximately  normally  distributed.  The  department  assigns  a  C  to  students 
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whose  scores  constitute  the  middle  2/3  of  all  scores,  if  scores  this  semester  had 
mean  72.5  and  standard  deviation  6.14,  find  the  interval  of  scores  that  will  be 
assigned  a  C. 

22.  Researchers  wish  to  investigate  the  overall  health  of  individuals  with 

abnormally  high  or  low  levels  of  glucose  in  the  blood  stream.  Suppose  glucose 
levels  are  normally  distributed  with  mean  96  and  standard  deviation  8.5  mg/d 
tf ,  and  that  “normal”  is  defined  as  the  middle  90%  of  the  population.  Find  the 
interval  of  normal  glucose  levels,  that  is,  the  interval  centered  at  96  that 
contains  90%  of  all  glucose  levels  in  the  population. 


ADDITIONAL  EXERCISES 


23.  A  machine  for  filling  2-liter  bottles  of  soft  drink  delivers  an  amount  to  each 
bottle  that  varies  from  bottle  to  bottle  according  to  a  normal  distribution  with 
standard  deviation  0.002  liter  and  mean  whatever  amount  the  machine  is  set  to 
deliver. 

a.  If  the  machine  is  set  to  deliver  2  liters  (so  the  mean  amount  delivered  is  2 
liters)  what  proportion  of  the  bottles  will  contain  at  least  2  liters  of  soft 
drink? 

b.  Find  the  minimum  setting  of  the  mean  amount  delivered  by  the  machine 
so  that  at  least  99%  of  all  bottles  will  contain  at  least  2  liters. 

24.  A  nursery  has  observed  that  the  mean  number  of  days  it  must  darken  the 
environment  of  a  species  poinsettia  plant  daily  in  order  to  have  it  ready  for 
market  is  71  days.  Suppose  the  lengths  of  such  periods  of  darkening  are 
normally  distributed  with  standard  deviation  2  days.  Find  the  number  of  days 
in  advance  of  the  projected  delivery  dates  of  the  plants  to  market  that  the 
nursery  must  begin  the  daily  darkening  process  in  order  that  at  least  95%  of 
the  plants  will  be  ready  on  time.  (Poinsettias  are  so  long-lived  that  once  ready 
for  market  the  plant  remains  salable  indefinitely.) 
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ANSWERS 


1.  a.  -2.43 

b.  2.17 

c.  -1.28 

d.  2.29 

3.  a.  -1.04 

b.  0.67 

c.  0.43 

d.  -0.84 

5.  a.  1.96 

b.  0.84 

7.  a.  87.52 

b.  89.58 

9.  15.32 

11.  a.  382 

b.  415 

13.  7030.14,  7069,  7107.86 

15.  85.90 

17.  65,054 

19.  1.58,  1.86 

21.  66.5,78.5 

23.  a.  0.5 

b.  2.005 
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Sampling  Distributions 


A  statistic,  such  as  the  sample  mean  or  the  sample  standard  deviation,  is  a  number 
computed  from  a  sample.  Since  a  sample  is  random,  every  statistic  is  a  random 
variable:  it  varies  from  sample  to  sample  in  a  way  that  cannot  be  predicted  with 
certainty.  As  a  random  variable  it  has  a  mean,  a  standard  deviation,  and  a 
probability  distribution.  The  probability  distribution  of  a  statistic  is  called  its 
sampling  distribution1.  Typically  sample  statistics  are  not  ends  in  themselves,  but 
are  computed  in  order  to  estimate  the  corresponding  population  parameters,  as 
illustrated  in  the  grand  picture  of  statistics  presented  in  Figure  1.1  "The  Grand 
Picture  of  Statistics"  in  Chapter  1  "Introduction". 


This  chapter  introduces  the  concepts  of  the  mean,  the  standard  deviation,  and  the 
sampling  distribution  of  a  sample  statistic,  with  an  emphasis  on  the  sample  mean  x . 


1.  The  probability  distribution  of 
a  sample  statistic  when  the 
statistic  is  viewed  as  a  random 
variable. 
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6.1  The  Mean  and  Standard  Deviation  of  the  Sample  Mean 


LEARNING  OBJECTIVES 

1.  To  become  familiar  with  the  concept  of  the  probability  distribution  of 
the  sample  mean. 

2.  To  understand  the  meaning  of  the  formulas  for  the  mean  and  standard 
deviation  of  the  sample  mean. 


Suppose  we  wish  to  estimate  the  mean  /i  of  a  population.  In  actual  practice  we 
would  typically  take  just  one  sample.  Imagine  however  that  we  take  sample  after 
sample,  all  of  the  same  size  n,  and  compute  the  sample  mean  x  of  each  one.  We  will 
likely  get  a  different  value  of  x  each  time.  The  sample  mean  x  is  a  random  variable: 
it  varies  from  sample  to  sample  in  a  way  that  cannot  be  predicted  with  certainty. 
We  will  write  X  when  the  sample  mean  is  thought  of  as  a  random  variable,  and 
write  X  for  the  values  that  it  takes.  The  random  variable  X  has  a  mean2,  denoted 
and  a  standard  deviation3,  denoted  crj.  Here  is  an  example  with  such  a  small 
population  and  small  sample  size  that  we  can  actually  write  down  every  single 
sample. 


2.  The  number  about  which 
means  computed  from  samples 
of  the  same  size  center. 

3.  A  measure  of  the  variability  of 
means  computed  from  samples 
of  the  same  size. 
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EXAMPLE  1 


A  rowing  team  consists  of  four  rowers  who  weigh  152, 156, 160,  and  164 
pounds.  Find  all  possible  random  samples  with  replacement  of  size  two  and 
compute  the  sample  mean  for  each  one.  Use  them  to  find  the  probability 
distribution,  the  mean,  and  the  standard  deviation  of  the  sample  mean  X  . 

Solution 

The  following  table  shows  all  possible  samples  with  replacement  of  size  two, 
along  with  the  mean  of  each: 


Sample 

Mean 

Sample 

Mean 

Sample 

Mean 

Sample 

Mean 

152, 

152 

152 

156, 

152 

154 

160, 

152 

156 

164, 

152 

158 

152, 

156 

154 

156, 

156 

156 

160, 

156 

158 

164, 

156 

160 

152, 

160 

156 

156, 

160 

158 

160, 

160 

160 

164, 

160 

162 

152, 

164 

158 

156, 

164 

160 

160, 

164 

162 

164, 

164 

164 

The  table  shows  that  there  are  seven  possible  values  of  the  sample  mean  X  . 
The  value  X  =  1 52  happens  only  one  way  (the  rower  weighing  152  pounds 
must  be  selected  both  times),  as  does  the  value  X  =  1 64,  but  the  other 
values  happen  more  than  one  way,  hence  are  more  likely  to  be  observed 
than  152  and  164  are.  Since  the  16  samples  are  equally  likely,  we  obtain  the 
probability  distribution  of  the  sample  mean  just  by  counting: 


X 

152 

154 

156 

158 

160 

162 

164 

P(x) 

1 

2 

3 

4 

3 

2 

1 

16 

16 

16 

16 

16 

16 

16 

Now  we  apply  the  formulas  from  Section  4.2.2  "The  Mean  and  Standard 
Deviation  of  a  Discrete  Random  Variable"  in  Chapter  4  "Discrete  Random 
Variables"  for  the  mean  and  standard  deviation  of  a  discrete  random 
variable  to  X  .  For  f U ^  we  obtain. 
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lij  =  HjX  P[x ) 

=  152 (le)  +  154  (£)  + 156 (le)  + 158  (t9  + 160 ( 

-  158 

For  6-^  we  first  compute  Si'  “  P  (x  ): 

1522  (i) +  1542  (4) +  1562  (£) +  1582  (A) +  1602  ( 

which  is  24,974,  so  that 

trF  =  tJzx2P(x)  ~/i\  =  ^24,974-  1582  -  ^To 


The  mean  and  standard  deviation  of  the  population  {152,156,160,164}  in  the 
example  are  /j  =  158  and  a  —  \/20.  The  mean  of  the  sample  mean  X  that  we  have 
just  computed  is  exactly  the  mean  of  the  population.  The  standard  deviation  of  the 
sample  mean  X  that  we  have  just  computed  is  the  standard  deviation  of  the 
population  divided  by  the  square  root  of  the  sample  size:  \/l0  =  \/20/  \[2.  These 
relationships  are  not  coincidences,  but  are  illustrations  of  the  following  formulas. 


Suppose  random  samples  of  size  n  are  drawn  from  a  population  with  mean  /i 
and  standard  deviation  a.  The  mean  ^  and  standard  deviation  nj  of  the 
sample  mean  X  satisfy 

fix  =  A*  and  oj  =  - 


The  first  formula  says  that  if  we  could  take  every  possible  sample  from  the 
population  and  compute  the  corresponding  sample  mean,  then  those  numbers 
would  center  at  the  number  we  wish  to  estimate,  the  population  mean  /i. 
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The  second  formula  says  that  averages  computed  from  samples  vary  less  than 
individual  measurements  on  the  population  do,  and  quantifies  the  relationship. 


EXAMPLE  2 


The  mean  and  standard  deviation  of  the  tax  value  of  all  vehicles  registered 
in  a  certain  state  are  yM  =  $ 1 3,525  and  O  —  54, 1 80.  Suppose  random 
samples  of  size  100  are  drawn  from  the  population  of  vehicles.  What  are  the 
mean  and  standard  deviation  (7%  of  the  sample  mean  X  ? 

Solution 

Since  n  =  100,  the  formulas  yield 


HY  =  n  =  513,525  and  vj 


<7 


54180 

VIoo 


-  5418 


KEY  TAKEAWAYS 


•  The  sample  mean  is  a  random  variable;  as  such  it  is  written  X ,  and  X 
stands  for  individual  values  it  takes. 

•  As  a  random  variable  the  sample  mean  has  a  probability  distribution,  a 
mean  yWj^,  and  a  standard  deviation 

•  There  are  formulas  that  relate  the  mean  and  standard  deviation  of  the 
sample  mean  to  the  mean  and  standard  deviation  of  the  population  from 
which  the  sample  is  drawn. 
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EXERCISES 


1.  Random  samples  of  size  225  are  drawn  from  a  population  with  mean  100  and 
standard  deviation  20.  Find  the  mean  and  standard  deviation  of  the  sample 
mean. 

2.  Random  samples  of  size  64  are  drawn  from  a  population  with  mean  32  and 
standard  deviation  5.  Find  the  mean  and  standard  deviation  of  the  sample 
mean. 

3.  A  population  has  mean  75  and  standard  deviation  12. 

a.  Random  samples  of  size  121  are  taken.  Find  the  mean  and  standard 
deviation  of  the  sample  mean. 

b.  How  would  the  answers  to  part  (a)  change  if  the  size  of  the  samples  were 
400  instead  of  121? 

4.  A  population  has  mean  5.75  and  standard  deviation  1.02. 

a.  Random  samples  of  size  81  are  taken.  Find  the  mean  and  standard 
deviation  of  the  sample  mean. 

b.  How  would  the  answers  to  part  (a)  change  if  the  size  of  the  samples  were 
25  instead  of  81? 


ANSWERS 


i.  =  1 00,  oj  =  1.33 
3.  a.  —  7 5,  —  1.09 

b.  stays  the  same  but  decreases  to  0.6 
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6.2  The  Sampling  Distribution  of  the  Sample  Mean 


LEARNING  OBJECTIVES 

1.  To  learn  what  the  sampling  distribution  of  X  is  when  the  sample  size  is 
large. 

2.  To  learn  what  the  sampling  distribution  of  X  is  when  the  population  is 
normal. 


The  Central  Limit  Theorem 

In  Note  6.5  "Example  1"  in  Section  6,1  "The  Mean  and  Standard  Deviation  of  the 
Sample  Mean"  we  constructed  the  probability  distribution  of  the  sample  mean  for 
samples  of  size  two  drawn  from  the  population  of  four  rowers.  The  probability 
distribution  is: 


X 

152 

154 

156 

158 

160 

162 

164 

poo 

1 

2 

3 

4 

3 

2 

1 

16 

16 

16 

16 

16 

16 

16 

Figure  6.1  "Distribution  of  a  Population  and  a  Sample  Mean"  shows  a  side-by-side 
comparison  of  a  histogram  for  the  original  population  and  a  histogram  for  this 
distribution.  Whereas  the  distribution  of  the  population  is  uniform,  the  sampling 
distribution  of  the  mean  has  a  shape  approaching  the  shape  of  the  familiar  bell 
curve.  This  phenomenon  of  the  sampling  distribution  of  the  mean  taking  on  a  bell 
shape  even  though  the  population  distribution  is  not  bell-shaped  happens  in 
general.  Here  is  a  somewhat  more  realistic  example. 
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Figure  6.1  Distribution  of  a  Population  and  a  Sample  Mean 


(a)  Population 


(b)  Sample  Mean 


Suppose  we  take  samples  of  size  1,  5, 10,  or  20  from  a  population  that  consists 
entirely  of  the  numbers  0  and  1,  half  the  population  0,  half  1,  so  that  the  population 
mean  is  0.5.  The  sampling  distributions  are: 


n  =  1: 


X 

0  1 

P(x) 

0.5  0.5 

n  =  5: 


X 

0 

0.2 

0.4 

0.6 

0.8  1 

P(x) 

0.03 

0.16 

0.31 

0.31 

0.16  0.03 

n=  10: 


X 

0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

1 

POO 

0.00 

0.01 

0.04 

0.12 

0.21 

0.25 

0.21 

0.12 

0.04 

0.01 

0.00 

n  =  20: 


X 

0 

0.05 

0.10 

0.15 

0.20 

0.25 

0.30 

0.35 

0.40 

0.45 

0.50 

POO 

0.00 

0.00 

0.00 

0.00 

0.00 

0.01 

0.04 

0.07 

0.12 

0.16 

0.18 
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X 

0.55 

0.60 

0.65 

0.70 

0.75 

0.80 

0.85 

0.90 

0.95  1 

POO 

0.16 

0.12 

0.07 

0.04 

0.01 

0.00 

0.00 

0.00 

0.00  0.00 

Histograms  illustrating  these  distributions  are  shown  in  Figure  6.2  "Distributions  of 
the  Sample  Mean". 


Figure  6.2  Distributions  of  the  Sample  Mean 


Tl  =  1 


As  n  increases  the  sampling  distribution  of  X  evolves  in  an  interesting  way:  the 
probabilities  on  the  lower  and  the  upper  ends  shrink  and  the  probabilities  in  the 
middle  become  larger  in  relation  to  them,  if  we  were  to  continue  to  increase  n  then 
the  shape  of  the  sampling  distribution  would  become  smoother  and  more  bell¬ 
shaped. 

What  we  are  seeing  in  these  examples  does  not  depend  on  the  particular  population 
distributions  involved.  In  general,  one  may  start  with  any  distribution  and  the 
sampling  distribution  of  the  sample  mean  will  increasingly  resemble  the  bell¬ 
shaped  normal  curve  as  the  sample  size  increases.  This  is  the  content  of  the  Central 
Limit  Theorem. 


6.2  The  Sampling  Distribution  of  the  Sample  Mean 


295 


Chapter  6  Sampling  Distributions 


The  Central  Limit  Theorem 

For  samples  of  size  30  or  more,  the  sample  mean  is  approximately  normally 
distributed,  with  mean  fi j  =  and  standard  deviation  nj  =  at  yjn,  where  n  is 
the  sample  size.  The  larger  the  sample  size,  the  better  the  approximation. 


The  Central  Limit  Theorem  is  illustrated  for  several  common  population 
distributions  in  Figure  6.3  "Distribution  of  Populations  and  Sample  Means". 


Figure  6.3  Distribution  of  Populations  and  Sample  Means 


The  dashed  vertical  lines  in  the  figures  locate  the  population  mean.  Regardless  of 
the  distribution  of  the  population,  as  the  sample  size  is  increased  the  shape  of  the 
sampling  distribution  of  the  sample  mean  becomes  increasingly  bell-shaped, 
centered  on  the  population  mean.  Typically  by  the  time  the  sample  size  is  30  the 
distribution  of  the  sample  mean  is  practically  the  same  as  a  normal  distribution. 


The  importance  of  the  Central  Limit  Theorem  is  that  it  allows  us  to  make 
probability  statements  about  the  sample  mean,  specifically  in  relation  to  its  value  in 
comparison  to  the  population  mean,  as  we  will  see  in  the  examples.  But  to  use  the 
result  properly  we  must  first  realize  that  there  are  two  separate  random  variables 
(and  therefore  two  probability  distributions)  at  play: 
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1.  X,  the  measurement  of  a  single  element  selected  at  random  from  the 
population;  the  distribution  of  X  is  the  distribution  of  the  population, 
with  mean  the  population  mean  /./  and  standard  deviation  the 
population  standard  deviation  a; 

2.  X,  the  mean  of  the  measurements  in  a  sample  of  size  n;  the  distribution 
of  X  is  its  sampling  distribution,  with  mean  j  and  standard 
deviation  nj  = 


6.2  The  Sampling  Distribution  of  the  Sample  Mean 


297 


Chapter  6  Sampling  Distributions 


EXAMPLE  3 


Let  X  be  the  mean  of  a  random  sample  of  size  50  drawn  from  a  population 
with  mean  112  and  standard  deviation  40. 

a.  Find  the  mean  and  standard  deviation  of  X  . 

b.  Find  the  probability  that  X  assumes  a  value  between  110  and  114. 

c.  Find  the  probability  that  X  assumes  a  value  greater  than  113. 

Solution 


a.  By  the  formulas  in  the  previous  section 

O 

fij  =  pi  =  112  and  aj  = 


40 


=  5.65685 


yjn  \J5  0 


b.  Since  the  sample  size  is  at  least  30,  the  Central  Limit  Theorem 
applies:  X  is  approximately  normally  distributed.  We  compute 
probabilities  using  Figure  12.2  "Cumulative  Normal  Probability" 
in  the  usual  way,  just  being  careful  to  use  and  not  o  when  we 
standardize: 


P(ll0  <  X  <  114)  =  P 


110  —  Uy  114  —  Uy 

- —  <  z  < - — 


=  p 


°X 


110-  112 


<  Z  < 


°x 


114  -  112 


5.65685  5.65685  ) 

=  P  (-0.35  <  Z  <  0.35)  =  0.6368  -  0.3632 


c.  Similarly 
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P(x  >  113) 


=  P 


=  P 


z  > 

113  —  Px 

z> 

113  -  112 

5.65685 

P(Z>  0.18) 

1  -P(Z  <  0.18)  =  1  -  0.5714  =  0.4286 


Note  that  if  in  Note  6.11  "Example  3"  we  had  been  asked  to  compute  the  probability 
that  the  value  of  a  single  randomly  selected  element  of  the  population  exceeds  113, 
that  is,  to  compute  the  number  P(X  >  113),  we  would  not  have  been  able  to  do  so, 
since  we  do  not  know  the  distribution  of  X,  but  only  that  its  mean  is  112  and  its 
standard  deviation  is  40.  By  contrast  we  could  compute  P  (X  >  113)  even  without 
complete  knowledge  of  the  distribution  of  X  because  the  Central  Limit  Theorem 
guarantees  that  X  is  approximately  normal. 
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EXAMPLE  4 


The  numerical  population  of  grade  point  averages  at  a  college  has  mean  2.61 
and  standard  deviation  0.5.  if  a  random  sample  of  size  100  is  taken  from  the 
population,  what  is  the  probability  that  the  sample  mean  will  be  between 
2.51  and  2.71? 


Solution 


The  sample  mean  X  has  mean  pty  =  fi  =  2.61  and  standard  deviation 

ay  =  a!\Jn  —  0.5/10  =  0.05,  so 


P(2.51  <  X  <  2.71)  =  P 

=  P 


2.51  —  fly  2.11  —  uy 

- —  <  Z  <  - — 


^X 


^x 


2.51-2.61  2.71-2.61 

-  <  Z  < 


0.05 


0.05 


=  P(-2  <  Z  <  2) 

=  P  (Z  <  2)  -  P  (Z  <  -2) 

=  0.9772  -  0.0228  =  0.9544 


Normally  Distributed  Populations 

The  Central  Limit  Theorem  says  that  no  matter  what  the  distribution  of  the 
population  is,  as  long  as  the  sample  is  “large,”  meaning  of  size  30  or  more,  the 
sample  mean  is  approximately  normally  distributed,  if  the  population  is  normal  to 
begin  with  then  the  sample  mean  also  has  a  normal  distribution,  regardless  of  the 
sample  size. 


For  samples  of  any  size  drawn  from  a  normally  distributed  population,  the 
sample  mean  is  normally  distributed,  with  mean  pi y  —  n  and  standard 
deviation  oy  —  a! yjn,  where  n  is  the  sample  size. 


The  effect  of  increasing  the  sample  size  is  shown  in  Figure  6.4  "Distribution  of 
Sample  Means  for  a  Normal  Population". 
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Figure  6.4  Distribution  of  Sample  Means  for  a  Normal  Population 
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EXAMPLE  5 


A  prototype  automotive  tire  has  a  design  life  of  38,500  miles  with  a  standard 
deviation  of  2,500  miles.  Five  such  tires  are  manufactured  and  tested.  On  the 
assumption  that  the  actual  population  mean  is  38,500  miles  and  the  actual 
population  standard  deviation  is  2,500  miles,  find  the  probability  that  the 
sample  mean  will  be  less  than  36,000  miles.  Assume  that  the  distribution  of 
lifetimes  of  such  tires  is  normal. 


Solution 


For  simplicity  we  use  units  of  thousands  of  miles.  Then  the  sample  mean  X 
has  mean =  //  =  38.5  and  standard  deviation 

<7-  =  olyfn  =  2.5/ \/5  =  1.11803.  Since  the  population  is  normally 
distributed,  so  is  X ,  hence 


P(x  <  36)  =  P(z< 


36 -nx 


=  P 
=  P 


36  -  38.5  \ 
1.11803  ) 


(Z  <  -2.24)  =  0.0125 


That  is,  if  the  tires  perform  as  designed,  there  is  only  about  a  1.25%  chance 
that  the  average  of  a  sample  of  this  size  would  be  so  low. 
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An  automobile  battery  manufacturer  claims  that  its  midgrade  battery  has  a 
mean  life  of  50  months  with  a  standard  deviation  of  6  months.  Suppose  the 
distribution  of  battery  lives  of  this  particular  brand  is  approximately 
normal. 

a.  On  the  assumption  that  the  manufacturer’s  claims  are  true,  find  the 
probability  that  a  randomly  selected  battery  of  this  type  will  last  less 
than  48  months. 

b.  On  the  same  assumption,  find  the  probability  that  the  mean  of  a  random 
sample  of  36  such  batteries  will  be  less  than  48  months. 


Solution 


a.  Since  the  population  is  known  to  have  a  normal  distribution 


P  (X  <  48) 


P(Z  <  -0.33)  =  0.3707 


b.  The  sample  mean  has  mean  =  /A  =  50  and  standard 
deviation  (T^  =  o!\J Yl  =  6/\/ 36  =  l.Thus 


P  (Z  <  -2)  =  0.0228 
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KEY  TAKEAWAYS 


•  When  the  sample  size  is  at  least  30  the  sample  mean  is  normally 
distributed. 

•  When  the  population  is  normal  the  sample  mean  is  normally  distributed 
regardless  of  the  sample  size. 
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a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  36. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  36  will  be  within  10 
units  of  the  population  mean,  that  is,  between  118  and  138. 


2.  A  population  has  mean  1,542  and  standard  deviation  246. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  100. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  100  will  be  within 
100  units  of  the  population  mean,  that  is,  between  1,442  and  1,642. 

3.  A  population  has  mean  73.5  and  standard  deviation  2.5. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  30. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  30  will  be  less  than 
72. 

4.  A  population  has  mean  48.4  and  standard  deviation  6.3. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  64. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  64  will  be  less  than 
46.7. 

5.  A  normally  distributed  population  has  mean  25.6  and  standard  deviation  3.3. 

a.  Find  the  probability  that  a  single  randomly  selected  element  X  of  the 
population  exceeds  30. 

b.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  9. 

c.  Find  the  probability  that  the  mean  of  a  sample  of  size  9  drawn  from  this 
population  exceeds  30. 

6.  A  normally  distributed  population  has  mean  57.7  and  standard  deviation  12.1. 

a.  Find  the  probability  that  a  single  randomly  selected  element  X  of  the 
population  is  less  than  45. 

b.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  16. 

c.  Find  the  probability  that  the  mean  of  a  sample  of  size  16  drawn  from  this 
population  is  less  than  45. 

7.  A  population  has  mean  557  and  standard  deviation  35. 
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a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  50. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  50  will  be  more  than 
570. 

8.  A  population  has  mean  16  and  standard  deviation  1.7. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  80. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  80  will  be  more  than 
16.4. 

9.  A  normally  distributed  population  has  mean  1,214  and  standard  deviation  122. 

a.  Find  the  probability  that  a  single  randomly  selected  element  X  of  the 
population  is  between  1,100  and  1,300. 

b.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  25. 

c.  Find  the  probability  that  the  mean  of  a  sample  of  size  25  drawn  from  this 
population  is  between  1,100  and  1,300. 

10.  A  normally  distributed  population  has  mean  57,800  and  standard  deviation 

750. 

a.  Find  the  probability  that  a  single  randomly  selected  element  X  of  the 
population  is  between  57,000  and  58,000. 

b.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  100. 

c.  Find  the  probability  that  the  mean  of  a  sample  of  size  100  drawn  from  this 
population  is  between  57,000  and  58,000. 

11.  A  population  has  mean  72  and  standard  deviation  6. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  45. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  45  will  differ  from 
the  population  mean  72  by  at  least  2  units,  that  is,  is  either  less  than  70  or 
more  than  74.  (Hint:  One  way  to  solve  the  problem  is  to  first  find  the 
probability  of  the  complementary  event.) 

12.  A  population  has  mean  12  and  standard  deviation  1.5. 

a.  Find  the  mean  and  standard  deviation  of  X  for  samples  of  size  90. 

b.  Find  the  probability  that  the  mean  of  a  sample  of  size  90  will  differ  from 
the  population  mean  12  by  at  least  0.3  unit,  that  is,  is  either  less  than  11.7 
or  more  than  12.3.  (Hint:  One  way  to  solve  the  problem  is  to  first  find  the 
probability  of  the  complementary  event.) 
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APPLICATIONS 


13.  Suppose  the  mean  number  of  days  to  germination  of  a  variety  of  seed  is  22, 
with  standard  deviation  2.3  days.  Find  the  probability  that  the  mean 
germination  time  of  a  sample  of  160  seeds  will  be  within  0.5  day  of  the 
population  mean. 

14.  Suppose  the  mean  length  of  time  that  a  caller  is  placed  on  hold  when 
telephoning  a  customer  service  center  is  23.8  seconds,  with  standard  deviation 
4.6  seconds.  Find  the  probability  that  the  mean  length  of  time  on  hold  in  a 
sample  of  1,200  calls  will  be  within  0.5  second  of  the  population  mean. 

15.  Suppose  the  mean  amount  of  cholesterol  in  eggs  labeled  “large”  is  186 
milligrams,  with  standard  deviation  7  milligrams.  Find  the  probability  that  the 
mean  amount  of  cholesterol  in  a  sample  of  144  eggs  will  be  within  2  milligrams 
of  the  population  mean. 

16.  Suppose  that  in  one  region  of  the  country  the  mean  amount  of  credit  card  debt 
per  household  in  households  having  credit  card  debt  is  $15,250,  with  standard 
deviation  $7,125.  Find  the  probability  that  the  mean  amount  of  credit  card  debt 
in  a  sample  of  1,600  such  households  will  be  within  $300  of  the  population 
mean. 

17.  Suppose  speeds  of  vehicles  on  a  particular  stretch  of  roadway  are  normally 
distributed  with  mean  36.6  mph  and  standard  deviation  1.7  mph. 

a.  Find  the  probability  that  the  speed  X  of  a  randomly  selected  vehicle  is 
between  35  and  40  mph. 

b.  Find  the  probability  that  the  mean  speed  X  of  20  randomly  selected 
vehicles  is  between  35  and  40  mph. 

18.  Many  sharks  enter  a  state  of  tonic  immobility  when  inverted.  Suppose  that  in  a 
particular  species  of  sharks  the  time  a  shark  remains  in  a  state  of  tonic 
immobility  when  inverted  is  normally  distributed  with  mean  11.2  minutes  and 
standard  deviation  1.1  minutes. 

a.  if  a  biologist  induces  a  state  of  tonic  immobility  in  such  a  shark  in  order  to 
study  it,  find  the  probability  that  the  shark  will  remain  in  this  state  for 
between  10  and  13  minutes. 

b.  When  a  biologist  wishes  to  estimate  the  mean  time  that  such  sharks  stay 
immobile  by  inducing  tonic  immobility  in  each  of  a  sample  of  12  sharks, 
find  the  probability  that  mean  time  of  immobility  in  the  sample  will  be 
between  10  and  13  minutes. 

19.  Suppose  the  mean  cost  across  the  country  of  a  30-day  supply  of  a  generic  drug 
is  $46.58,  with  standard  deviation  $4.84.  Find  the  probability  that  the  mean  of  a 
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sample  of  100  prices  of  30-day  supplies  of  this  drug  will  be  between  $45  and 
$50. 

20.  Suppose  the  mean  length  of  time  between  submission  of  a  state  tax  return 
requesting  a  refund  and  the  issuance  of  the  refund  is  47  days,  with  standard 
deviation  6  days.  Find  the  probability  that  in  a  sample  of  50  returns  requesting 
a  refund,  the  mean  such  time  will  be  more  than  50  days. 

21.  Scores  on  a  common  final  exam  in  a  large  enrollment,  multiple-section 
freshman  course  are  normally  distributed  with  mean  72.7  and  standard 
deviation  13.1. 

a.  Find  the  probability  that  the  score  X  on  a  randomly  selected  exam  paper  is 
between  70  and  80. 

b.  Find  the  probability  that  the  mean  score  X  of  38  randomly  selected  exam 
papers  is  between  70  and  80. 

22.  Suppose  the  mean  weight  of  school  children’s  boolcbags  is  17.4  pounds,  with 
standard  deviation  2.2  pounds.  Find  the  probability  that  the  mean  weight  of  a 
sample  of  30  bookbags  will  exceed  17  pounds. 

23.  Suppose  that  in  a  certain  region  of  the  country  the  mean  duration  of  first 
marriages  that  end  in  divorce  is  7.8  years,  standard  deviation  1.2  years.  Find 
the  probability  that  in  a  sample  of  75  divorces,  the  mean  age  of  the  marriages 
is  at  most  8  years. 

24.  Borachio  eats  at  the  same  fast  food  restaurant  every  day.  Suppose  the  time  X 
between  the  moment  Borachio  enters  the  restaurant  and  the  moment  he  is 
served  his  food  is  normally  distributed  with  mean  4.2  minutes  and  standard 
deviation  1.3  minutes. 

a.  Find  the  probability  that  when  he  enters  the  restaurant  today  it  will  be  at 
least  5  minutes  until  he  is  served. 

b.  Find  the  probability  that  average  time  until  he  is  served  in  eight  randomly 
selected  visits  to  the  restaurant  will  be  at  least  5  minutes. 


ADDITIONAL  EXERCISES 


25.  A  high-speed  packing  machine  can  be  set  to  deliver  between  11  and  13  ounces 
of  a  liquid.  For  any  delivery  setting  in  this  range  the  amount  delivered  is 
normally  distributed  with  mean  some  amount  /i  and  with  standard  deviation 
0.08  ounce.  To  calibrate  the  machine  it  is  set  to  deliver  a  particular  amount, 
many  containers  are  filled,  and  25  containers  are  randomly  selected  and  the 
amount  they  contain  is  measured.  Find  the  probability  that  the  sample  mean 
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will  be  within  0.05  ounce  of  the  actual  mean  amount  being  delivered  to  all 
containers. 

26.  A  tire  manufacturer  states  that  a  certain  type  of  tire  has  a  mean  lifetime  of 
60,000  miles.  Suppose  lifetimes  are  normally  distributed  with  standard 
deviation  a  =  3,500  miles. 

a.  Find  the  probability  that  if  you  buy  one  such  tire,  it  will  last  only  57,000  or 
fewer  miles,  if  you  had  this  experience,  is  it  particularly  strong  evidence 
that  the  tire  is  not  as  good  as  claimed? 

b.  A  consumer  group  buys  five  such  tires  and  tests  them.  Find  the  probability 
that  average  lifetime  of  the  five  tires  will  be  57,000  miles  or  less,  if  the 
mean  is  so  low,  is  that  particularly  strong  evidence  that  the  tire  is  not  as 
good  as  claimed? 
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ANSWERS 


1. 

a. 

Mx  ~ 

128,  oj  =  3.67 

b. 

0.9936 

3. 

a. 

H'X  = 

73.5,  =  0.456 

b. 

0.0005 

5. 

a. 

0.0918 

b. 

f4 X  = 

25.6, trj  =  1.1 

c. 

0.0000 

7. 

a. 

Mx  = 

557,  =  4.9497 

b. 

0.0043 

9. 

a. 

0.5818 

b. 

f4 X  = 

1214,  =  24.4 

c. 

0.9998 

11. 

a. 

f4 X  = 

72,  =  0.8944 

b. 

0.0250 

13. 

0.9940 

15. 

0.9994 

17. 

a. 

0.8036 

b. 

1.0000 

19. 

0.9994 

21. 

a. 

0.2955 

b. 

0.8977 

23. 

0.9251 

25. 

0.9982 
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6.3  The  Sample  Proportion 


LEARNING  OBJECTIVES 

1.  To  recognize  that  the  sample  proportion  P  is  a  random  variable. 

2.  To  understand  the  meaning  of  the  formulas  for  the  mean  and  standard 
deviation  of  the  sample  proportion. 

3.  To  learn  what  the  sampling  distribution  of  P  is  when  the  sample  size  is 
large. 


Often  sampling  is  done  in  order  to  estimate  the  proportion  of  a  population  that  has 
a  specific  characteristic,  such  as  the  proportion  of  all  items  coming  off  an  assembly 
line  that  are  defective  or  the  proportion  of  all  people  entering  a  retail  store  who 
make  a  purchase  before  leaving.  The  population  proportion  is  denoted  p  and  the 
sample  proportion  is  denoted  p.  Thus  if  in  reality  43%  of  people  entering  a  store 
make  a  purchase  before  leaving,  p  =  0.43;  if  in  a  sample  of  200  people  entering  the 
store,  78  make  a  purchase,  p  —  78/200  =  0.39. 


The  sample  proportion  is  a  random  variable:  it  varies  from  sample  to  sample  in  a 

way  that  cannot  be  predicted  with  certainty.  Viewed  as  a  random  variable  it  will  be 
/\  .  - 
written  P.  It  has  a  mean  pj,  and  a  standard  deviation  op  Here  are  formulas  for 

their  values. 


4.  The  number  about  which 
proportions  computed  from 
samples  of  the  same  size 
center. 


Suppose  random  samples  of  size  n  are  drawn  from  a  population  in  which  the 
proportion  with  a  characteristic  of  interest  is  p.  The  mean  p~  and  standard 

deviation  of  the  sample  proportion  P  satisfy 


Pp=  P 


where  q  —  1  —  p. 


5.  A  measure  of  the  variability  of 
proportions  computed  from 
samples  of  the  same  size. 


The  Central  Limit  Theorem  has  an  analogue  for  the  population  proportion  P.  To  see 
how,  imagine  that  every  element  of  the  population  that  has  the  characteristic  of 
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interest  is  labeled  with  a  1,  and  that  every  element  that  does  not  is  labeled  with  a  0. 
This  gives  a  numerical  population  consisting  entirely  of  zeros  and  ones.  Clearly  the 
proportion  of  the  population  with  the  special  characteristic  is  the  proportion  of  the 
numerical  population  that  are  ones;  in  symbols, 


P  = 


number  of  1  s 
N 


But  of  course  the  sum  of  all  the  zeros  and  ones  is  simply  the  number  of  ones,  so  the 
mean  p  of  the  numerical  population  is 


number  of  1  s 
N 


Thus  the  population  proportion  p  is  the  same  as  the  mean  p  of  the  corresponding 
population  of  zeros  and  ones.  In  the  same  way  the  sample  proportion  p  is  the  same 
as  the  sample  mean  x .  Thus  the  Central  Limit  Theorem  applies  to  P.  However,  the 
condition  that  the  sample  be  large  is  a  little  more  complicated  than  just  being  of 
size  at  least  30. 


The  Sampling  Distribution  of  the  Sample  Proportion 

For  large  samples,  the  sample  proportion  is  approximately  normally 
distributed,  with  mean  pj,  —  p  and  standard  deviation  0'p  —  yjpqln. 

A  sample  is  large  if  the  interval  [/?— 3  er~,  p  +  3  j  lies  wholly  within  the 
interval  [0, 1  ]  . 


In  actual  practice  p  is  not  known,  hence  neither  is  .  In  that  case  in  order  to  check 
that  the  sample  is  sufficiently  large  we  substitute  the  known  quantity  p  for  p.  This 
means  checking  that  the  interval 


P~ 


3 
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lies  wholly  within  the  interval  [0, 1  ]  .  This  is  illustrated  in  the  examples. 


Figure  6.5  "Distribution  of  Sample  Proportions"  shows  that  when  p  =  0.1  a  sample  of 
size  15  is  too  small  but  a  sample  of  size  100  is  acceptable.  Figure  6.6  "Distribution  of 
Sample  Proportions  for  "  shows  that  when  p  =  0.5  a  sample  of  size  15  is  acceptable. 


Figure  6.5  Distribution  of  Sample  Proportions 


f- 3-^Ep}  =  -°.13  ;>+3^2lf2)  =  0.33  p-3yj£^}  =  0.01  ifM  =  0. 19 


(a)/>  =  0.1,  «  =  15 


(b)  />  =  0.1,  w=100 


Figure  6.6  Distribution  of  Sample  Proportions  for  p  =  0.5  and  n  =  15 
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EXAMPLE  7 


Suppose  that  in  a  population  of  voters  in  a  certain  region  38%  are  in  favor  of 
particular  bond  issue.  Nine  hundred  randomly  selected  voters  are  asked  if 
they  favor  the  bond  issue. 


a.  Verify  that  the  sample  proportion  P  computed  from  samples  of  size  900 
meets  the  condition  that  its  sampling  distribution  be  approximately 
normal. 

b.  Find  the  probability  that  the  sample  proportion  computed  from  a 
sample  of  size  900  will  be  within  5  percentage  points  of  the  true 
population  proportion. 

Solution 


a.  The  information  given  is  that  p  =  0.38,  hence 

q  =  1  ~  p  —  0.62.  First  we  use  the  formulas  to  compute  the 
mean  and  standard  deviation  of  P: 


/./-  =  p  =  0.38  and  (7- 


(0.38)  (0.62) 
900 


0.01618 


Then  3(7-  =  3  (0.01618)  =  0.04854  «  0.05  so 


'p-3(Tpp  +  3(7-]  =  [0.38  -0.05,0.38  +  0.05]=  [0.33,0.43] 

which  lies  wholly  within  the  interval  [0,  l]  ,  so  it  is  safe  to 
assume  that  P  is  approximately  normally  distributed. 

b.  To  be  within  5  percentage  points  of  the  true  population 
proportion  0.38  means  to  be  between  0.38  —  0.05  =  0.33 
and  0.38  +  0.05  =  0.43.  Thus 
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P(0.33  < 


<  0.43) 


=  P 

=  P 


/  0.33  -\i~  0.43  ~  lip  \ 

V  ap  <  °p  ) 

(  0.33  -  0.38  ^  ^  ^  0.43-0.38 

V  0.01618  <  Z  <  0.01618 


=  P(-3.09  <  Z  <  3.09) 


=  P  (3.09)  -P  (-3.09) 

=  0.9990  -  0.0010  =  0.9980 
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EXAMPLE  8 


An  online  retailer  claims  that  90%  of  all  orders  are  shipped  within  12  hours 
of  being  received.  A  consumer  group  placed  121  orders  of  different  sizes  and 
at  different  times  of  day;  102  orders  were  shipped  within  12  hours. 

a.  Compute  the  sample  proportion  of  items  shipped  within  12  hours. 

b.  Confirm  that  the  sample  is  large  enough  to  assume  that  the  sample 
proportion  is  normally  distributed.  Use  p  =  0.90,  corresponding  to  the 
assumption  that  the  retailer’s  claim  is  valid. 

c.  Assuming  the  retailer’s  claim  is  true,  find  the  probability  that  a  sample 
of  size  121  would  produce  a  sample  proportion  so  low  as  was  observed  in 
this  sample. 

d.  Based  on  the  answer  to  part  (c),  draw  a  conclusion  about  the  retailer’s 
claim. 

Solution 

a.  The  sample  proportion  is  the  number  x  of  orders  that  are 

shipped  within  12  hours  divided  by  the  number  n  of  orders  in  the 
sample: 


-  *  102  noa 

p  =  -  =  —  =  0.84 
n  121 

b.  Since  p  =  0.90,  q  =  1  —  p  =  0. 10,  and  n  =  121, 


(0.90)  (0.10)  nn— 

=  V - 121 -  =  0'027 


hence 


[p-3cr~,p  +  3cj-]  =  [0.90  -0.08,0.90  +  0.08]=  [0.82,0.98] 


Because 


[0.82,  0.98]  C  [0,1]  , 
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it  is  appropriate  to  use  the  normal  distribution  to  compute  probabilities 
related  to  the  sample  proportion  P. 


c. 


Using  the  value  of  P  from  part  (a)  and  the  computation  in  part 

(b), 


0.84  -  /i?  \ 

°p  ) 

0.84  -  0.90  \ 

0.027  / 

■2.20)  =  0.0139 

d.  The  computation  shows  that  a  random  sample  of  size  121  has  only  about 
a  1.4%  chance  of  producing  a  sample  proportion  as  the  one  that  was 
observed,  p  =  0.84,  when  taken  from  a  population  in  which  the  actual 
proportion  is  0.90.  This  is  so  unlikely  that  it  is  reasonable  to  conclude 
that  the  actual  value  of  p  is  less  than  the  90%  claimed. 


P(p  <  0.84)  =  P^Z  < 


=  P[Z< 
=  P(Z< 


KEY  TAKEAWAYS 


•  The  sample  proportion  is  a  random  variable  P. 

•  There  are  formulas  for  the  mean  and  standard  deviation  Gj,  of  the 
sample  proportion. 

•  When  the  sample  size  is  large  the  sample  proportion  is  normally 
distributed. 
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1.  The  proportion  of  a  population  with  a  characteristic  of  interest  is  p  =  0.37.  Find 
the  mean  and  standard  deviation  of  the  sample  proportion  P  obtained  from 
random  samples  of  size  1,600. 

2.  The  proportion  of  a  population  with  a  characteristic  of  interest  is  p  =  0.82.  Find 
the  mean  and  standard  deviation  of  the  sample  proportion  P  obtained  from 
random  samples  of  size  900. 

3.  The  proportion  of  a  population  with  a  characteristic  of  interest  is  p  =  0.76.  Find 
the  mean  and  standard  deviation  of  the  sample  proportion  P  obtained  from 
random  samples  of  size  1,200. 

4.  The  proportion  of  a  population  with  a  characteristic  of  interest  is  p  =  0.37.  Find 
the  mean  and  standard  deviation  of  the  sample  proportion  P  obtained  from 
random  samples  of  size  125. 

5.  Random  samples  of  size  225  are  drawn  from  a  population  in  which  the 
proportion  with  the  characteristic  of  interest  is  0.25.  Decide  whether  or  not  the 

sample  size  is  large  enough  to  assume  that  the  sample  proportion  P  is 
normally  distributed. 

6.  Random  samples  of  size  1,600  are  drawn  from  a  population  in  which  the 
proportion  with  the  characteristic  of  interest  is  0.05.  Decide  whether  or  not  the 

sample  size  is  large  enough  to  assume  that  the  sample  proportion  P  is 
normally  distributed. 

7.  Random  samples  of  size  n  produced  sample  proportions  p  as  shown.  In  each 
case  decide  whether  or  not  the  sample  size  is  large  enough  to  assume  that  the 

A. 

sample  proportion  P  is  normally  distributed. 

a.  n  =  50,/?  =  0.48 

b.  n  =  50,/7  =  0.12 

c.  n  =  100,/?  =  0.12 

8.  Samples  of  size  n  produced  sample  proportions  p  as  shown.  In  each  case  decide 
whether  or  not  the  sample  size  is  large  enough  to  assume  that  the  sample 

A 

proportion  P  is  normally  distributed, 
a.  n  =  30,/?  =  0.72 
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b.  n  =  30,p  =  0.84 

c.  n  =  75,p  =  0.84 

9.  A  random  sample  of  size  121  is  taken  from  a  population  in  which  the 

proportion  with  the  characteristic  of  interest  is  p  =  0.47.  Find  the  indicated 
probabilities. 

a.  P  (0.45  <  p  <  0.50) 

b.  P  (p  >  0.50) 

10.  A  random  sample  of  size  225  is  taken  from  a  population  in  which  the 
proportion  with  the  characteristic  of  interest  is  p  =  0.34.  Find  the  indicated 
probabilities. 

a.  P  (0.25  <  p  <  0.40 ) 

b.  P  (p  <  0.35) 

11.  A  random  sample  of  size  900  is  taken  from  a  population  in  which  the 
proportion  with  the  characteristic  of  interest  is  p  =  0.62.  Find  the  indicated 
probabilities. 

a.  P  (0.60  <  p  <  0.64) 

b.  P  (0.57  <  p  <  0.67) 

12.  A  random  sample  of  size  1,100  is  taken  from  a  population  in  which  the 
proportion  with  the  characteristic  of  interest  is  p  =  0.28.  Find  the  indicated 
probabilities. 

a.  P  (0.27  <  p  <  0.29) 

b.  P  (0.23  <  p  <  0.33) 


APPLICATIONS 


13.  Suppose  that  8%  of  all  males  suffer  some  form  of  color  blindness.  Find  the 
probability  that  in  a  random  sample  of  250  men  at  least  10%  will  suffer  some 
form  of  color  blindness.  First  verify  that  the  sample  is  sufficiently  large  to  use 
the  normal  distribution. 

14.  Suppose  that  29%  of  all  residents  of  a  community  favor  annexation  by  a  nearby 
municipality.  Find  the  probability  that  in  a  random  sample  of  50  residents  at 
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least  35%  will  favor  annexation.  First  verify  that  the  sample  is  sufficiently  large 
to  use  the  normal  distribution. 

15.  Suppose  that  2%  of  all  cell  phone  connections  by  a  certain  provider  are 
dropped.  Find  the  probability  that  in  a  random  sample  of  1,500  calls  at  most  40 
will  be  dropped.  First  verify  that  the  sample  is  sufficiently  large  to  use  the 
normal  distribution. 

16.  Suppose  that  in  20%  of  all  traffic  accidents  involving  an  injury,  driver 
distraction  in  some  form  (for  example,  changing  a  radio  station  or  texting)  is  a 
factor.  Find  the  probability  that  in  a  random  sample  of  275  such  accidents 
between  15%  and  25%  involve  driver  distraction  in  some  form.  First  verify  that 
the  sample  is  sufficiently  large  to  use  the  normal  distribution. 

17.  An  airline  claims  that  72%  of  all  its  flights  to  a  certain  region  arrive  on  time.  In 
a  random  sample  of  30  recent  arrivals,  19  were  on  time.  You  may  assume  that 
the  normal  distribution  applies. 

a.  Compute  the  sample  proportion. 

b.  Assuming  the  airline’s  claim  is  true,  find  the  probability  of  a  sample  of  size 
30  producing  a  sample  proportion  so  low  as  was  observed  in  this  sample. 

18.  A  humane  society  reports  that  19%  of  all  pet  dogs  were  adopted  from  an 
animal  shelter.  Assuming  the  truth  of  this  assertion,  find  the  probability  that 
in  a  random  sample  of  80  pet  dogs,  between  15%  and  20%  were  adopted  from  a 
shelter.  You  may  assume  that  the  normal  distribution  applies. 

19.  In  one  study  it  was  found  that  86%  of  all  homes  have  a  functional  smoke 
detector.  Suppose  this  proportion  is  valid  for  all  homes.  Find  the  probability 
that  in  a  random  sample  of  600  homes,  between  80%  and  90%  will  have  a 
functional  smoke  detector.  You  may  assume  that  the  normal  distribution 
applies. 

20.  A  state  insurance  commission  estimates  that  13%  of  all  motorists  in  its  state 
are  uninsured.  Suppose  this  proportion  is  valid.  Find  the  probability  that  in  a 
random  sample  of  50  motorists,  at  least  5  will  be  uninsured.  You  may  assume 
that  the  normal  distribution  applies. 

21.  An  outside  financial  auditor  has  observed  that  about  4%  of  all  documents  he 
examines  contain  an  error  of  some  sort.  Assuming  this  proportion  to  be 
accurate,  find  the  probability  that  a  random  sample  of  700  documents  will 
contain  at  least  30  with  some  sort  of  error.  You  may  assume  that  the  normal 
distribution  applies. 

22.  Suppose  7%  of  all  households  have  no  home  telephone  but  depend  completely 
on  cell  phones.  Find  the  probability  that  in  a  random  sample  of  450 
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households,  between  25  and  35  will  have  no  home  telephone.  You  may  assume 
that  the  normal  distribution  applies. 


ADDITIONAL  EXERCISES 


23.  Some  countries  allow  individual  packages  of  prepackaged  goods  to  weigh  less 
than  what  is  stated  on  the  package,  subject  to  certain  conditions,  such  as  the 
average  of  all  packages  being  the  stated  weight  or  greater.  Suppose  that  one 
requirement  is  that  at  most  4%  of  all  packages  marked  500  grams  can  weigh 
less  than  490  grams.  Assuming  that  a  product  actually  meets  this  requirement, 
find  the  probability  that  in  a  random  sample  of  150  such  packages  the 
proportion  weighing  less  than  490  grams  is  at  least  3%.  You  may  assume  that 
the  normal  distribution  applies. 

24.  An  economist  wishes  to  investigate  whether  people  are  keeping  cars  longer 
now  than  in  the  past.  He  knows  that  five  years  ago,  38%  of  all  passenger 
vehicles  in  operation  were  at  least  ten  years  old.  He  commissions  a  study  in 
which  325  automobiles  are  randomly  sampled.  Of  them,  132  are  ten  years  old 
or  older. 

a.  Find  the  sample  proportion. 

b.  Find  the  probability  that,  when  a  sample  of  size  325  is  drawn  from  a 
population  in  which  the  true  proportion  is  0.38,  the  sample  proportion  will 
be  as  large  as  the  value  you  computed  in  part  (a).  You  may  assume  that  the 
normal  distribution  applies. 

c.  Give  an  interpretation  of  the  result  in  part  (b).  Is  there  strong  evidence 
that  people  are  keeping  their  cars  longer  than  was  the  case  five  years  ago? 

25.  A  state  public  health  department  wishes  to  investigate  the  effectiveness  of  a 
campaign  against  smoking.  Historically  22%  of  all  adults  in  the  state  regularly 
smoked  cigars  or  cigarettes.  In  a  survey  commissioned  by  the  public  health 
department,  279  of  1,500  randomly  selected  adults  stated  that  they  smoke 
regularly. 

a.  Find  the  sample  proportion. 

b.  Find  the  probability  that,  when  a  sample  of  size  1,500  is  drawn  from  a 
population  in  which  the  true  proportion  is  0.22,  the  sample  proportion  will 
be  no  larger  than  the  value  you  computed  in  part  (a).  You  may  assume  that 
the  normal  distribution  applies. 

c.  Give  an  interpretation  of  the  result  in  part  (b).  How  strong  is  the  evidence 
that  the  campaign  to  reduce  smoking  has  been  effective? 

26.  In  an  effort  to  reduce  the  population  of  unwanted  cats  and  dogs,  a  group  of 
veterinarians  set  up  a  low-cost  spay/neuter  clinic.  At  the  inception  of  the  clinic 
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a  survey  of  pet  owners  indicated  that  78%  of  all  pet  dogs  and  cats  in  the 
community  were  spayed  or  neutered.  After  the  low-cost  clinic  had  been  in 
operation  for  three  years,  that  figure  had  risen  to  86%. 

a.  What  information  is  missing  that  you  would  need  to  compute  the 
probability  that  a  sample  drawn  from  a  population  in  which  the 
proportion  is  78%  (corresponding  to  the  assumption  that  the  low-cost 
clinic  had  had  no  effect)  is  as  high  as  86%? 

b.  Knowing  that  the  size  of  the  original  sample  three  years  ago  was  150  and 
that  the  size  of  the  recent  sample  was  125,  compute  the  probability 
mentioned  in  part  (a).  You  may  assume  that  the  normal  distribution 
applies. 

c.  Give  an  interpretation  of  the  result  in  part  (b).  How  strong  is  the  evidence 
that  the  presence  of  the  low-cost  clinic  has  increased  the  proportion  of  pet 
dogs  and  cats  that  have  been  spayed  or  neutered? 

27.  An  ordinary  die  is  “fair”  or  “balanced”  if  each  face  has  an  equal  chance  of 
landing  on  top  when  the  die  is  rolled.  Thus  the  proportion  of  times  a  three  is 

observed  in  a  large  number  of  tosses  is  expected  to  be  close  to  l/ 6  or  0. 1 6 . 
Suppose  a  die  is  rolled  240  times  and  shows  three  on  top  36  times,  for  a  sample 
proportion  of  0.15. 

a.  Find  the  probability  that  a  fair  die  would  produce  a  proportion  of  0.15  or 
less.  You  may  assume  that  the  normal  distribution  applies. 

b.  Give  an  interpretation  of  the  result  in  part  (b).  How  strong  is  the  evidence 
that  the  die  is  not  fair? 

c.  Suppose  the  sample  proportion  0.15  came  from  rolling  the  die  2,400  times 
instead  of  only  240  times.  Rework  part  (a)  under  these  circumstances. 

d.  Give  an  interpretation  of  the  result  in  part  (c).  How  strong  is  the  evidence 
that  the  die  is  not  fair? 
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ANSWERS 


i.  p-  =  0.37,  a-  =  0.012 
3.  =  0.76,  <7-  =  0.012 

5.  p  ±  3y^f  =  0.25  ±  0.087  ,yes 


7. 


a.  £±3 

b.  p  ±3 
C.  p  ±3 


0.48  ±  0.21 ,  yes 
0.12  ±  0.14,  no 

0.12  ±  0.10, yes 


9.  a.  0.4154 
b.  0.2546 


11.  a.  0.7850 
b.  0.9980 

13.  p±  3y^f  =  0.08  ±  0.05 

and 

[0.03,0. 13]  C  [0,1],  0.1210 
is.  p±3^f  =  0.02  ±  0.01 

and 

[0.01,0.03]  C  [0,1], 0.9671 

17.  a.  0.63 

b.  0.1446 

19.  0.9977 

21.  0.3483 

23.  0.7357 

25.  a.  0.186 

b.  0.0007 
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27. 


c.  In  a  population  in  which  the  true  proportion  is  22%  the  chance  that  a 
random  sample  of  size  1500  would  produce  a  sample  proportion  of  18.6% 
or  less  is  only  7/100  of  1%.  This  is  strong  evidence  that  currently  a  smaller 
proportion  than  22%  smoke. 

a.  0.2451 

b.  We  would  expect  a  sample  proportion  of  0.15  or  less  in  about  24.5%  of  all 
samples  of  size  240,  so  this  is  practically  no  evidence  at  all  that  the  die  is 
not  fair. 

c.  0.0139 

d.  We  would  expect  a  sample  proportion  of  0.15  or  less  in  only  about  1.4%  of 
all  samples  of  size  2400,  so  this  is  strong  evidence  that  the  die  is  not  fair. 
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if  we  wish  to  estimate  the  mean  /i  of  a  population  for  which  a  census  is  impractical, 
say  the  average  height  of  all  18-year-old  men  in  the  country,  a  reasonable  strategy 
is  to  take  a  sample,  compute  its  mean  x,  and  estimate  the  unknown  number  /i  by 
the  known  number  x .  For  example,  if  the  average  height  of  100  randomly  selected 
men  aged  18  is  70.6  inches,  then  we  would  say  that  the  average  height  of  all 
18-year-old  men  is  (at  least  approximately)  70.6  inches. 


Estimating  a  population  parameter  by  a  single  number  like  this  is  called  point 
estimation;  in  the  case  at  hand  the  statistic  x  is  a  point  estimate  of  the  parameter 
fi.  The  terminology  arises  because  a  single  number  corresponds  to  a  single  point  on 
the  number  line. 


A  problem  with  a  point  estimate  is  that  it  gives  no  indication  of  how  reliable  the 
estimate  is.  In  contrast,  in  this  chapter  we  learn  about  interval  estimation.  In  brief, 
in  the  case  of  estimating  a  population  mean  /j  we  use  a  formula  to  compute  from 
the  data  a  number  E,  called  the  margin  of  error1  of  the  estimate,  and  form  the 
interval  [x  —  E,  x  +  E]  .We  do  this  in  such  a  way  that  a  certain  proportion,  say 
95%,  of  all  the  intervals  constructed  from  sample  data  by  means  of  this  formula 
contain  the  unknown  parameter  /./.  Such  an  interval  is  called  a  95%  confidence 
interval2  for  /./. 


1.  E,  the  number  added  to  and 
subtracted  from  the  point 
estimate  to  produce  the 
interval  estimate. 

2.  An  interval  with  endpoints 
X  ±  E,  computed  from  the 
sample  data  in  such  a  way  that 
a  specified  proportion  of  all 
intervals  constructed  by  this 
process  will  contain  the 
parameter  of  interest. 


Continuing  with  the  example  of  the  average  height  of  18-year-old  men,  suppose 
that  the  sample  of  100  men  mentioned  above  for  which  x  —  70.6  inches  also  had 
sample  standard  deviation  s  =  1.7  inches.  It  then  turns  out  that  E  =  0.33  and  we 
would  state  that  we  are  95%  confident  that  the  average  height  of  all  18-year-old 
men  is  in  the  interval  formed  by  70.6  ±  0.33  inches,  that  is,  the  average  is  between 
70.27  and  70.93  inches.  If  the  sample  statistics  had  come  from  a  smaller  sample,  say 
a  sample  of  50  men,  the  lower  reliability  would  show  up  in  the  95%  confidence 
interval  being  longer,  hence  less  precise  in  its  estimate.  In  this  example  the  95% 
confidence  interval  for  the  same  sample  statistics  but  with  n  =  50  is  70.6  ±  0.47 
inches,  or  from  70.13  to  71.07  inches. 
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7.1  Large  Sample  Estimation  of  a  Population  Mean 


LEARNING  OBJECTIVES 

1.  To  become  familiar  with  the  concept  of  an  interval  estimate  of  the 
population  mean. 

2.  To  understand  how  to  apply  formulas  for  a  confidence  interval  for  a 
population  mean. 


The  Central  Limit  Theorem  says  that,  for  large  samples  (samples  of  size  n  a  30), 
when  viewed  as  a  random  variable  the  sample  mean  X  is  normally  distributed  with 
meaner  =  pi  and  standard  deviation  oj  =  a  j  yjn. The  Empirical  Rule  says  that 
we  must  go  about  two  standard  deviations  from  the  mean  to  capture  95%  of  the 
values  of  X  generated  by  sample  after  sample.  A  more  precise  distance  based  on  the 
normality  of  A  is  1.960  standard  deviations,  which  is  E  —  1.960(7  j  yjn. 

The  key  idea  in  the  construction  of  the  95%  confidence  interval  is  this,  as  illustrated 
in  Figure  7.1  "When  Winged  Dots  Capture  the  Population  Mean":  because  in  sample 
after  sample  95%  of  the  values  of  X  lie  in  the  interval  [pi  —  E,  pi  +  E[  if  we  adjoin 
to  each  side  of  the  point  estimate  x  a  “wing”  of  length  E,  95%  of  the  intervals 
formed  by  the  winged  dots  contain  pi.  The  95%  confidence  interval  is  thus 
X  ±  1 .960(7  j  yjn. For  a  different  level  of  confidence3,  say  90%  or  99%,  the 
number  1.960  will  change,  but  the  idea  is  the  same. 


Figure  7.1  When  Winged  Dots  Capture  the  Population  Mean 

X-E  X  X+E 

I - • - 1 


x-E  x  x+E 

I - • - 1 


3.  The  proportion  of  confidence 
intervals  which,  if  under 
repeated  random  sampling 
were  always  constructed 
according  to  the  formula  of  the 
text,  would  contain  the 
parameter  of  interest. 


— i - 1 - 1 

pi-E  E  pi+E 


Figure  7.2  "Computer  Simulation  of  40  95%  Confidence  Intervals  for  a  Mean"  shows 

the  intervals  generated  by  a  computer  simulation  of  drawing  40  samples  from  a 
normally  distributed  population  and  constructing  the  95%  confidence  interval  for 
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each  one.  We  expect  that  about  (0.05 )  (40)  =  2of  the  intervals  so  constructed 
would  fail  to  contain  the  population  mean  fi,  and  in  this  simulation  two  of  the 
intervals,  shown  in  red,  do. 


Figure  7.2  Computer  Simulation  of 40  95%  Confidence  Intervals  for  a  Mean 


ft-E  ft+E 


It  is  standard  practice  to  identify  the  level  of  confidence  in  terms  of  the  area  a  in 
the  two  tails  of  the  distribution  of  X  when  the  middle  part  specified  by  the  level  of 
confidence  is  taken  out.  This  is  shown  in  Figure  7,3,  drawn  for  the  general  situation, 
and  in  Figure  7,4,  drawn  for  95%  confidence.  Remember  from  Section  5.4.1  "Tails  of 
the  Standard  Normal  Distribution"  in  Chapter  5  "Continuous  Random  Variables" 
that  the  z-value  that  cuts  off  a  right  tail  of  area  c  is  denoted  zc.  Thus  the  number 
1.960  in  the  example  is  z.025 ,  which  is  za/2  for  a  —  1  —  0.95  =  0.05. 


7.1  Targe  Sample  Estimation  of  a  Population  Mean 


327 


Chapter  7  Estimation 


Figure  7.3 


For  100  (1  —  a)%  confidence  the  area  in  each  tail  is  a  /  2. 


Figure  7.4 


1-a  =  0.95 
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The  level  of  confidence  can  be  any  number  between  0  and  100%,  but  the  most 
common  values  are  probably  90%  (a  =  0.10),  95%  (a  =  0.05),  and  99%  (a  =  0.01). 


Thus  in  general  for  a  100  (1  —  a)%  confidence  interval,  E  —  za/2  /  \/n  b  so 

.While  sometimes 


the  formula  for  the  confidence  interval  is  X  ±  za/2 

the  population  standard  deviation  a  is  known,  typically  it  is  not.  if  not,  for  n  a  30  it 
is  generally  safe  to  approximate  o  by  the  sample  standard  deviation  s. 


Large  Sample  100  (1  —  Cl)  %  Confidence  Interval  for  a 
Population  Mean 


if  a  is  known:  x  ±z, 


a/2 


/  \ 

a 

\  yfi  / 


/  \ 


if  a  is  unknown:  x  ±  za/2 


111 


A  sample  is  considered  large  when  n  a  30. 


As  mentioned  earlier,  the  number  E  —  za/ 2®  /  yjn  or  E  —  za/ 2$  /  called 
the  margin  of  error  of  the  estimate. 
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EXAMPLE  1 


Find  the  number  Za/ 2  needed  in  construction  of  a  confidence  interval: 

a.  when  the  level  of  confidence  is  90%; 

b.  when  the  level  of  confidence  is  99%. 

Solution: 

a.  For  confidence  level  90%,  a  =  1  —  0.90  =  0.10,soZa/2  =  Zo.05  • 
The  procedure  for  finding  this  number  was  given  in  Section  5.4.1  "Tails 
of  the  Standard  Normal  Distribution".  Since  the  area  under  the  standard 
normal  curve  to  the  right  of  Z.05  is  0.05,  the  area  to  the  left  of  Z0.05  is 
0.95.  We  search  for  the  area  0.9500  in  Figure  12.2  "Cumulative  Normal 
Probability".  The  closest  entries  in  the  table  are  0.9495  and  0.9505, 
corresponding  to  z-values  1.64  and  1.65.  Since  0.95  is  exactly  halfway 
between  0.9495  and  0.9505  we  use  the  average  1.645  of  the  z-values  for 
z0.05- 

b.  For  confidence  level  99%,  a  =  1  —  0.99  =  0.01,soZa/2  —  Zo.005  • 
Since  the  area  under  the  standard  normal  curve  to  the  right  of  zo.oos  is 
0.005,  the  area  to  the  left  of  Z0.005  is  0.9950.  We  search  for  the  area  0.9950 
in  Figure  12.2  "Cumulative  Normal  Probability".  The  closest  entries  in 
the  table  are  0.9949  and  0.9951,  corresponding  to  z-values  2.57  and  2.58. 
Since  0.995  is  halfway  between  0.9949  and  0.9951  we  use  the  average 
2.575  of  the  z-values  for  zq.oos- 
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EXAMPLE  2 


Use  Figure  12.3  "Critical  Values  of"  to  find  the  number  Za/2  needed  in 
construction  of  a  confidence  interval: 

a.  when  the  level  of  confidence  is  90%; 

b.  when  the  level  of  confidence  is  99%. 

Solution: 

a.  In  the  next  section  we  will  learn  about  a  continuous  random  variable 
that  has  a  probability  distribution  called  the  Student  t-distribution. 
Figure  12.3  "Critical  Values  of"  gives  the  value  tc  that  cuts  off  a  right  tail 
of  area  c  for  different  values  of  c.  The  last  line  of  that  table,  the  one 
whose  heading  is  the  symbol  go  for  infinity  and  [z] ,  gives  the 
corresponding  z-value  zc  that  cuts  off  a  right  tail  of  the  same  area  c.  In 
particular,  zo.os  is  the  number  in  that  row  and  in  the  column  with  the 

heading  to.os- We  read  off  directly  that  Zo.05  —  1-645. 

b.  In  Figure  12.3  "Critical  Values  of"  zo.oos  is  the  number  in  the  last  row 
and  in  the  column  headed  to. 005,  namely  2.576. 


Figure  12.3  "Critical  Values  of"  can  be  used  to  find  zc  only  for  those  values  of  c  for 
which  there  is  a  column  with  the  heading  tc  appearing  in  the  table;  otherwise  we 
must  use  Figure  12.2  "Cumulative  Normal  Probability"  in  reverse.  But  when  it  can 
be  done  it  is  both  faster  and  more  accurate  to  use  the  last  line  of  Figure  12,3 
"Critical  Values  of "  to  find  zc  than  it  is  to  do  so  using  Figure  12.2  "Cumulative 
Normal  Probability"  in  reverse. 
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EXAMPLE  3 


A  sample  of  size  49  has  sample  mean  35  and  sample  standard  deviation  14. 
Construct  a  98%  confidence  interval  for  the  population  mean  using  this 
information.  Interpret  its  meaning. 

Solution: 


For  confidence  level  98%,  a  =  1  —  0.98  =  0.02 ,  so  Za/2  —  Zo.01  • 
From  Figure  12.3  "Critical  Values  of  "  we  read  directly  that  Zn  m  =  2.326. 
Thus 


A  ±  Za/2 


/ 

35  ±  2.326 


V 


=  35  ±  4.652  «  35  ±  4.7 


We  are  98%  confident  that  the  population  mean  /i  lies  in  the  interval 
[30.  3,39.  7]  ,  in  the  sense  that  in  repeated  sampling  98%  of  all  intervals 
constructed  from  the  sample  data  in  this  manner  will  contain  /n. 
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EXAMPLE  4 


A  random  sample  of  120  students  from  a  large  university  yields  mean  GPA 
2.71  with  sample  standard  deviation  0.51.  Construct  a  90%  confidence 
interval  for  the  mean  GPA  of  all  students  at  the  university. 

Solution: 


For  confidence  level  90%,  Ct  =  1  —  0.90  =  0.10,soZa/2  —  Zo.05  • 
From  Figure  12.3  "Critical  Values  of  "  we  read  directly  that  Zn  ns  =  1.645. 
Since  n  =  120,  X  =  2.71,  and  s  =  0.51, 


X  ±  Za/ 2 


y/n 


=  2.71  ±  1.645 


/ 

0.51 

\/l20 


=  2.71  ±  0.0766 


One  may  be  90%  confident  that  the  true  average  GPA  of  all  students  at  the 
university  is  contained  in  the  interval 

(2.71  -  0.  08,2.  71  +  0.08)  =  (2.  63,2.  79)  . 


KEY  TAKEAWAYS 


•  A  confidence  interval  for  a  population  mean  is  an  estimate  of  the 
population  mean  together  with  an  indication  of  reliability. 

•  There  are  different  formulas  for  a  confidence  interval  based  on  the 
sample  size  and  whether  or  not  the  population  standard  deviation  is 
known. 

•  The  confidence  intervals  are  constructed  entirely  from  the  sample  data 
(or  sample  data  and  the  population  standard  deviation,  when  it  is 
known). 
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1.  A  random  sample  is  drawn  from  a  population  of  known  standard  deviation 
11.3.  Construct  a  90%  confidence  interval  for  the  population  mean  based  on  the 
information  given  (not  all  of  the  information  given  need  be  used). 

a.  n  =  36, X  =  105.2, s  =  11.2 

b.  n  =  100, X  =  105.2, s  =  11.2 

2.  A  random  sample  is  drawn  from  a  population  of  known  standard  deviation 
22.1.  Construct  a  95%  confidence  interval  for  the  population  mean  based  on  the 
information  given  (not  all  of  the  information  given  need  be  used). 

a.  n  =  121,  X  =  82.4,  s  =  21.9 

b.  n  =  81,  X  =  82.4,  s  =  21.9 

3.  A  random  sample  is  drawn  from  a  population  of  unknown  standard  deviation. 
Construct  a  99%  confidence  interval  for  the  population  mean  based  on  the 
information  given. 

a.  n  =  49, X  =  17.1,s  =  2.1 

b.  n  =  169, X  =  17.1,s  =  2.1 

4.  A  random  sample  is  drawn  from  a  population  of  unknown  standard  deviation. 
Construct  a  98%  confidence  interval  for  the  population  mean  based  on  the 
information  given. 

a.  n  =  225,  X  =  92.0,  s  =  8.4 

b.  n  =  64,  X  =  92.0,  s  =  8.4 

5.  A  random  sample  of  size  144  is  drawn  from  a  population  whose  distribution, 
mean,  and  standard  deviation  are  all  unknown.  The  summary  statistics  are 
X  =  58.2  and s  =  2.6. 

a.  Construct  an  80%  confidence  interval  for  the  population  mean  /i. 

b.  Construct  a  90%  confidence  interval  for  the  population  mean  /lI. 

c.  Comment  on  why  one  interval  is  longer  than  the  other. 

6.  A  random  sample  of  size  256  is  drawn  from  a  population  whose  distribution, 
mean,  and  standard  deviation  are  all  unknown.  The  summary  statistics  are 
X  =  101 1  ands  =  34. 

a.  Construct  a  90%  confidence  interval  for  the  population  mean  [i. 
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b.  Construct  a  99%  confidence  interval  for  the  population  mean  fi. 

c.  Comment  on  why  one  interval  is  longer  than  the  other. 


APPLICATIONS 


7.  A  government  agency  was  charged  by  the  legislature  with  estimating  the 
length  of  time  it  takes  citizens  to  fill  out  various  forms.  Two  hundred  randomly 
selected  adults  were  timed  as  they  filled  out  a  particular  form.  The  times 
required  had  mean  12.8  minutes  with  standard  deviation  1.7  minutes. 

Construct  a  90%  confidence  interval  for  the  mean  time  taken  for  all  adults  to 
fill  out  this  form. 

8.  Four  hundred  randomly  selected  working  adults  in  a  certain  state,  including 
those  who  worked  at  home,  were  asked  the  distance  from  their  home  to  their 
workplace.  The  average  distance  was  8.84  miles  with  standard  deviation  2.70 
miles.  Construct  a  99%  confidence  interval  for  the  mean  distance  from  home  to 
work  for  all  residents  of  this  state. 

9.  On  every  passenger  vehicle  that  it  tests  an  automotive  magazine  measures,  at 
true  speed  55  mph,  the  difference  between  the  true  speed  of  the  vehicle  and 
the  speed  indicated  by  the  speedometer.  For  36  vehicles  tested  the  mean 
difference  was  -1.2  mph  with  standard  deviation  0.2  mph.  Construct  a  90% 
confidence  interval  for  the  mean  difference  between  true  speed  and  indicated 
speed  for  all  vehicles. 

10.  A  corporation  monitors  time  spent  by  office  workers  browsing  the  web  on 
their  computers  instead  of  working.  In  a  sample  of  computer  records  of  50 
workers,  the  average  amount  of  time  spent  browsing  in  an  eight-hour  work 
day  was  27.8  minutes  with  standard  deviation  8.2  minutes.  Construct  a  99.5% 
confidence  interval  for  the  mean  time  spent  by  all  office  workers  in  browsing 
the  web  in  an  eight-hour  day. 

11.  A  sample  of  250  workers  aged  16  and  older  produced  an  average  length  of  time 
with  the  current  employer  (“job  tenure”)  of  4.4  years  with  standard  deviation 
3.8  years.  Construct  a  99.9%  confidence  interval  for  the  mean  job  tenure  of  all 
workers  aged  16  or  older. 

12.  The  amount  of  a  particular  biochemical  substance  related  to  bone  breakdown 
was  measured  in  30  healthy  women.  The  sample  mean  and  standard  deviation 
were  3.3  nanograms  per  milliliter  (ng/mF)  and  1.4  ng/mF.  Construct  an  80% 
confidence  interval  for  the  mean  level  of  this  substance  in  all  healthy  women. 

13.  A  corporation  that  owns  apartment  complexes  wishes  to  estimate  the  average 
length  of  time  residents  remain  in  the  same  apartment  before  moving  out.  A 
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sample  of  150  rental  contracts  gave  a  mean  length  of  occupancy  of  3.7  years 
with  standard  deviation  1.2  years.  Construct  a  95%  confidence  interval  for  the 
mean  length  of  occupancy  of  apartments  owned  by  this  corporation. 

14.  The  designer  of  a  garbage  truck  that  lifts  roll-out  containers  must  estimate  the 
mean  weight  the  truck  will  lift  at  each  collection  point.  A  random  sample  of 
325  containers  of  garbage  on  current  collection  routes  yielded  X  =  75.3  lb,  s 
=  12.8  lb.  Construct  a  99.8%  confidence  interval  for  the  mean  weight  the  trucks 
must  lift  each  time. 

15.  In  order  to  estimate  the  mean  amount  of  damage  sustained  by  vehicles  when  a 
deer  is  struck,  an  insurance  company  examined  the  records  of  50  such 
occurrences,  and  obtained  a  sample  mean  of  $2,785  with  sample  standard 
deviation  $221.  Construct  a  95%  confidence  interval  for  the  mean  amount  of 
damage  in  all  such  accidents. 

16.  In  order  to  estimate  the  mean  FICO  credit  score  of  its  members,  a  credit  union 
samples  the  scores  of  95  members,  and  obtains  a  sample  mean  of  738.2  with 
sample  standard  deviation  64.2.  Construct  a  99%  confidence  interval  for  the 
mean  FICO  score  of  all  of  its  members. 


ADDITIONAL  EXERCISES 


17.  For  all  settings  a  packing  machine  delivers  a  precise  amount  of  liquid;  the 
amount  dispensed  always  has  standard  deviation  0.07  ounce.  To  calibrate  the 
machine  its  setting  is  fixed  and  it  is  operated  50  times.  The  mean  amount 
delivered  is  6.02  ounces  with  sample  standard  deviation  0.04  ounce.  Construct 
a  99.5%  confidence  interval  for  the  mean  amount  delivered  at  this  setting. 

Hint:  Not  all  the  information  provided  is  needed. 

18.  A  power  wrench  used  on  an  assembly  line  applies  a  precise,  preset  amount  of 
torque;  the  torque  applied  has  standard  deviation  0.73  foot-pound  at  every 
torque  setting.  To  check  that  the  wrench  is  operating  within  specifications  it  is 
used  to  tighten  100  fasteners.  The  mean  torque  applied  is  36.95  foot-pounds 
with  sample  standard  deviation  0.62  foot-pound.  Construct  a  99.9%  confidence 
interval  for  the  mean  amount  of  torque  applied  by  the  wrench  at  this  setting. 
Hint:  Not  all  the  information  provided  is  needed. 

19.  The  number  of  trips  to  a  grocery  store  per  week  was  recorded  for  a  randomly 
selected  collection  of  households,  with  the  results  shown  in  the  table. 

2221423254 

2350323143 

3216233244 
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Construct  a  95%  confidence  interval  for  the  average  number  of  trips  to  a 
grocery  store  per  week  of  all  households. 

20.  For  each  of  40  high  school  students  in  one  county  the  number  of  days  absent 
from  school  in  the  previous  year  were  counted,  with  the  results  shown  in  the 
frequency  table. 


X 

0 

1 

2 

3 

4 

5 

/ 

24 

7 

5 

2 

1 

1 

Construct  a  90%  confidence  interval  for  the  average  number  of  days  absent 
from  school  of  all  students  in  the  county. 

21.  A  town  council  commissioned  a  random  sample  of  85  households  to  estimate 
the  number  of  four-wheel  vehicles  per  household  in  the  town.  The  results  are 
shown  in  the  following  frequency  table. 


X 

0 

1  2 

3 

4 

5 

/ 

1 

16  28 

22 

12 

6 

Construct  a  98%  confidence  interval  for  the  average  number  of  four-wheel 
vehicles  per  household  in  the  town. 

22.  The  number  of  hours  per  day  that  a  television  set  was  operating  was  recorded 
for  a  randomly  selected  collection  of  households,  with  the  results  shown  in  the 
table. 


3.7 

4.2 

1.5 

3.6 

5.9 

4.7 

8.2 

3.9 

2.5 

4.4 

2.1 

3.6 

1.1 

7.3 

4.2 

3.0 

3.8 

2.2 

4.2 

3.8 

4.3 

2.1 

2.4 

6.0 

3.7 

2.5 

1.3 

2.8 

3.0 

5.6 

Construct  a  99.8%  confidence  interval  for  the  mean  number  of  hours  that  a 
television  set  is  in  operation  in  all  households. 


LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Set  1  records  the  SAT  scores  of  1,000  students.  Regarding  it  as  a 

random  sample  of  all  high  school  students,  use  it  to  construct  a  99%  confidence 
interval  for  the  mean  SAT  score  of  all  students. 
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http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

24.  Large  Data  Set  1  records  the  GPAs  of  1,000  college  students.  Regarding  it  as  a 
random  sample  of  all  college  students,  use  it  to  construct  a  95%  confidence 
interval  for  the  mean  GPA  of  all  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

25.  Large  Data  Set  1  lists  the  SAT  scores  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  students  at  a  high  school,  in 
which  the  SAT  score  of  every  student  was  measured.  Compute  the 
population  mean  fi. 

b.  Regard  the  first  36  students  as  a  random  sample  and  use  it  to  construct  a 
99%  confidence  for  the  mean  /i  of  all  1,000  SAT  scores.  Does  it  actually 
capture  the  mean  /i? 

26.  Large  Data  Set  1  lists  the  GPAs  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  freshman  at  a  small  college 
at  the  end  of  their  first  academic  year  of  college  study,  in  which  the  GPA  of 
every  such  person  was  measured.  Compute  the  population  mean  /i. 

b.  Regard  the  first  36  students  as  a  random  sample  and  use  it  to  construct  a 
95%  confidence  for  the  mean  /i  of  all  1,000  GPAs.  Does  it  actually  capture 
the  mean  /j? 
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ANSWERS 


i.  a.  105.2  ±3.10 
b.  105.2  ±1.86 

3.  a.  17.1  ±0.77 
b.  17.1  ±0.42 

5.  a.  58.2  ±0.28 

b.  58.2  ±0.36 

c.  Asking  for  greater  confidence  requires  a  longer  interval. 

7.  12.8  ±0.20 

9.  -1.2  ±0.05 
n.  4.4  ±0.79 
is.  3.7  ±0.19 
is.  2785  ±61 
17.  6.02  ±0.03 
19.  2.8  ±0.48 
2i.  2.54  ±0.30 
23.  (1511.43,1546.05) 

25.  a.  fi  =  1528.74 

b.  (1428.22,1602.89) 
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7.2  Small  Sample  Estimation  of  a  Population  Mean 


4.  A  distribution  of  a  continuous 
random  variable  that 
resembles  that  standard 
normal  distribution  but  has 
heavier  tails. 

LEARNING  OBJECTIVES 

1.  To  become  familiar  with  Student’s  t-distribution. 

2.  To  understand  how  to  apply  additional  formulas  for  a  confidence 
interval  for  a  population  mean. 

The  confidence  interval  formulas  in  the  previous  section  are  based  on  the  Central 
Limit  Theorem,  the  statement  that  for  large  samples  X  is  normally  distributed  with 
mean  /j  and  standard  deviation  a  j  yjn  .  When  the  population  mean  n  is  estimated 
with  a  small  sample  ( n  <  30),  the  Central  Limit  Theorem  does  not  apply.  In  order  to 
proceed  we  assume  that  the  numerical  population  from  which  the  sample  is  taken 
has  a  normal  distribution  to  begin  with,  if  this  condition  is  satisfied  then  when  the 

population  standard  deviation  a  is  known  the  old  formula  x  ±  za/2  i^a  /  yfe'j  can 
still  be  used  to  construct  a  100  (1  —  a)%  confidence  interval  for  /i. 

if  the  population  standard  deviation  is  unknown  and  the  sample  size  n  is  small  then 
when  we  substitute  the  sample  standard  deviation  s  for  a  the  normal 
approximation  is  no  longer  valid.  The  solution  is  to  use  a  different  distribution, 
called  Student’s  t-distribution4  with  n—  1  degrees  of  freedom5.  Student’s  t- 
distribution  is  very  much  like  the  standard  normal  distribution  in  that  it  is  centered 
at  0  and  has  the  same  qualitative  bell  shape,  but  it  has  heavier  tails  than  the 
standard  normal  distribution  does,  as  indicated  bv  Ligure  7.5  "Student’s  ",  in  which 
the  curve  (in  brown)  that  meets  the  dashed  vertical  line  at  the  lowest  point  is  the  t- 
distribution  with  two  degrees  of  freedom,  the  next  curve  (in  blue)  is  the  t- 
distribution  with  five  degrees  of  freedom,  and  the  thin  curve  (in  red)  is  the  standard 
normal  distribution.  As  also  indicated  by  the  figure,  as  the  sample  size  n  increases, 
Student’s  t-distribution  ever  more  closely  resembles  the  standard  normal 
distribution.  Although  there  is  a  different  t-distribution  for  every  value  of  n,  once 
the  sample  size  is  30  or  more  it  is  typically  acceptable  to  use  the  standard  normal 
distribution  instead,  as  we  will  always  do  in  this  text. 

5.  A  number  that  specifies  a 
particular  t-distribution  and 
that  is  computed  based  on  the 
sample  size. 
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Figure  7.5  Student’s  t-Distribution 

Standard  normal 
t-  distribution  with  df  =  5 
t-  distribution  with  df  =  2 


Just  as  the  symbol  zc  stands  for  the  value  that  cuts  off  a  right  tail  of  area  c  in  the 
standard  normal  distribution,  so  the  symbol  tc  stands  for  the  value  that  cuts  off  a 
right  tail  of  area  c  in  the  standard  normal  distribution.  This  gives  us  the  following 
confidence  interval  formulas. 
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Small  Sample  100  (1  —  cc)  %  Confidence  Interval  for  a 
Population  Mean 


if  a  is  known:  x  ±z, 


a/2 


/  \ 

a 

\  yfi  / 


/  \ 


if  o  is  unknown:  A'  ±  tan 


S 


IvW 


(degrees  of  freedom  df  —  n  —  1) 


The  population  must  be  normally  distributed. 


A  sample  is  considered  small  when  n  <  30. 


To  use  the  new  formula  we  use  the  line  in  Figure  12.3  "Critical  Values  of"  that 
corresponds  to  the  relevant  sample  size. 
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EXAMPLE  5 


A  sample  of  size  15  drawn  from  a  normally  distributed  population  has 
sample  mean  35  and  sample  standard  deviation  14.  Construct  a  95% 
confidence  interval  for  the  population  mean,  and  interpret  its  meaning. 

Solution: 


Since  the  population  is  normally  distributed,  the  sample  is  small,  and  the 
population  standard  deviation  is  unknown,  the  formula  that  applies  is 


X 


±  4/2 


/  \ 
S 


vV4 


Confidence  level  95%  means  that  a  =  1  —  0.95  =  0.05  so 
a  /  2  =  0.025.  Since  the  sample  size  is  n  =  15,  there  are  n—l  =  14 
degrees  of  freedom.  By  Figure  12.3  "Critical  Values  of"  4.025  —  2 . 1 45 . 
Thus 


A  ±  4/2 


/  5 

(  \ 

S 

=  35  ±2.145 

14 

lVl5  J 

=  35  ±7.8 


One  may  be  95%  confident  that  the  true  value  of  /i  is  contained  in  the 

interval  (35  -  7.  8,35  +  7.8)  =  (27.  2,42.  8) . 
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EXAMPLE  6 


A  random  sample  of  12  students  from  a  large  university  yields  mean  GPA 
2.71  with  sample  standard  deviation  0.51.  Construct  a  90%  confidence 
interval  for  the  mean  GPA  of  all  students  at  the  university.  Assume  that  the 
numerical  population  of  GPAs  from  which  the  sample  is  taken  has  a  normal 
distribution. 

Solution: 


Since  the  population  is  normally  distributed,  the  sample  is  small,  and  the 
population  standard  deviation  is  unknown,  the  formula  that  applies  is 


X  ±  ta/2 


/  \ 
S 


\V"/ 


Confidence  level  90%  means  that  Ct  =  1  —  0.90  =  0.10  so 
a  /  2  =  0.05.  Since  the  sample  size  is  n  =  12,  there  are  Tl— 1  =  11 
degrees  of  freedom.  By  Figure  12.3  "Critical  Values  of  "  tn  ns  =  1.796. 
Thus 


X  ±  ta/2 


=  2.71  ±  1.796 


/  \ 

0.51 


\fn 


=  2.71  +  0.26 


One  may  be  90%  confident  that  the  true  average  GPA  of  all  students  at  the 
university  is  contained  in  the  interval 


(2.71  -  0.  26,2.  71  +  0.26)  =  (2. 45,2.  97)  . 


Compare  Note  7.9  "Example  4"  in  Section  7.1  "Large  Sample  Estimation  of  a 
Population  Mean"  and  Note  7.16  "Example  6".  The  summary  statistics  in  the  two 
samples  are  the  same,  but  the  90%  confidence  interval  for  the  average  GPA  of  all 
students  at  the  university  in  Note  7.9  "Example  4"  in  Section  7.1  "Large  Sample 
Estimation  of  a  Population  Mean".  (2.  63,2.  79),  is  shorter  than  the  90%  confidence 
interval  (2.  45,2.  97 ),  in  Note  7.16  "Example  6".  This  is  partly  because  in  Note  7.9 
"Example  4"  the  sample  size  is  larger;  there  is  more  information  pertaining  to  the 
true  value  of  fi  in  the  large  data  set  than  in  the  small  one. 
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KEY  TAKEAWAYS 


•  In  selecting  the  correct  formula  for  construction  of  a  confidence  interval 
for  a  population  mean  ask  two  questions:  is  the  population  standard 
deviation  o  known  or  unknown,  and  is  the  sample  large  or  small? 

•  We  can  construct  confidence  intervals  with  small  samples  only  if  the 
population  is  normal. 
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1.  A  random  sample  is  drawn  from  a  normally  distributed  population  of  known 
standard  deviation  5.  Construct  a  99.8%  confidence  interval  for  the  population 
mean  based  on  the  information  given  (not  all  of  the  information  given  need  be 
used). 

a.  n  =  16,  X  =  98,  s  =  5.6 

b.  n  =  9,X  =  98,  s  =  5.6 

2.  A  random  sample  is  drawn  from  a  normally  distributed  population  of  known 
standard  deviation  10.7.  Construct  a  95%  confidence  interval  for  the 
population  mean  based  on  the  information  given  (not  all  of  the  information 
given  need  be  used). 

a.  n  =  25, X  =  103.3,s=11.0 

b.  n  =  4,J  =  103.3,s  =  11.0 

3.  A  random  sample  is  drawn  from  a  normally  distributed  population  of  unknown 
standard  deviation.  Construct  a  99%  confidence  interval  for  the  population 
mean  based  on  the  information  given. 

a.  n  =  18,  X  =  386,  s  =  24 

b.  n  =  7,X  =  386,  s  =  24 

4.  A  random  sample  is  drawn  from  a  normally  distributed  population  of  unknown 
standard  deviation.  Construct  a  98%  confidence  interval  for  the  population 
mean  based  on  the  information  given. 

a.  n  =  8,X  =  58.3,  s  =  4.1 

b.  n  =  27,  X  =  58.3,s  =  4.1 

5.  A  random  sample  of  size  14  is  drawn  from  a  normal  population.  The  summary 
statistics  are  X  =  933  and  s  =  18. 

a.  Construct  an  80%  confidence  interval  for  the  population  mean  /i. 

b.  Construct  a  90%  confidence  interval  for  the  population  mean  fi. 

c.  Comment  on  why  one  interval  is  longer  than  the  other. 

6.  A  random  sample  of  size  28  is  drawn  from  a  normal  population.  The  summary 
statistics  are  X  =  68.6  and  s  =  1.28. 

a.  Construct  a  95%  confidence  interval  for  the  population  mean  /jl. 


7.2  Small  Sample  Estimation  of  a  Population  Mean 


346 


Chapter  7  Estimation 


b.  Construct  a  99.5%  confidence  interval  for  the  population  mean  /i. 

c.  Comment  on  why  one  interval  is  longer  than  the  other. 


APPLICATIONS 


7.  City  planners  wish  to  estimate  the  mean  lifetime  of  the  most  commonly 
planted  trees  in  urban  settings.  A  sample  of  16  recently  felled  trees  yielded 
mean  age  32.7  years  with  standard  deviation  3.1  years.  Assuming  the  lifetimes 
of  all  such  trees  are  normally  distributed,  construct  a  99.8%  confidence 
interval  for  the  mean  lifetime  of  all  such  trees. 

8.  To  estimate  the  number  of  calories  in  a  cup  of  diced  chicken  breast  meat,  the 
number  of  calories  in  a  sample  of  four  separate  cups  of  meat  is  measured.  The 
sample  mean  is  211.8  calories  with  sample  standard  deviation  0.9  calorie. 
Assuming  the  caloric  content  of  all  such  chicken  meat  is  normally  distributed, 
construct  a  95%  confidence  interval  for  the  mean  number  of  calories  in  one 
cup  of  meat. 

9.  A  college  athletic  program  wishes  to  estimate  the  average  increase  in  the  total 
weight  an  athlete  can  lift  in  three  different  lifts  after  following  a  particular 
training  program  for  six  weeks.  Twenty-five  randomly  selected  athletes  when 
placed  on  the  program  exhibited  a  mean  gain  of  47.3  lb  with  standard 
deviation  6.4  lb.  Construct  a  90%  confidence  interval  for  the  mean  increase  in 
lifting  capacity  all  athletes  would  experience  if  placed  on  the  training 
program.  Assume  increases  among  all  athletes  are  normally  distributed. 

10.  To  test  a  new  tread  design  with  respect  to  stopping  distance,  a  tire 
manufacturer  manufactures  a  set  of  prototype  tires  and  measures  the  stopping 
distance  from  70  mph  on  a  standard  test  car.  A  sample  of  25  stopping  distances 
yielded  a  sample  mean  173  feet  with  sample  standard  deviation  8  feet. 
Construct  a  98%  confidence  interval  for  the  mean  stopping  distance  for  these 
tires.  Assume  a  normal  distribution  of  stopping  distances. 

11.  A  manufacturer  of  chokes  for  shotguns  tests  a  choke  by  shooting  15  patterns  at 
targets  40  yards  away  with  a  specified  load  of  shot.  The  mean  number  of  shot 
in  a  30-inch  circle  is  53.5  with  standard  deviation  1.6.  Construct  an  80% 
confidence  interval  for  the  mean  number  of  shot  in  a  30-inch  circle  at  40  yards 
for  this  choke  with  the  specified  load.  Assume  a  normal  distribution  of  the 
number  of  shot  in  a  30-inch  circle  at  40  yards  for  this  choke. 

12.  In  order  to  estimate  the  speaking  vocabulary  of  three-year-old  children  in  a 
particular  socioeconomic  class,  a  sociologist  studies  the  speech  of  four 
children.  The  mean  and  standard  deviation  of  the  sample  are  X  =  1 1 20  and 
s  =  215  words.  Assuming  that  speaking  vocabularies  are  normally  distributed, 
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construct  an  80%  confidence  interval  for  the  mean  speaking  vocabulary  of  all 
three-year-old  children  in  this  socioeconomic  group. 

13.  A  thread  manufacturer  tests  a  sample  of  eight  lengths  of  a  certain  type  of 
thread  made  of  blended  materials  and  obtains  a  mean  tensile  strength  of  8.2  lb 
with  standard  deviation  0.06  lb.  Assuming  tensile  strengths  are  normally 
distributed,  construct  a  90%  confidence  interval  for  the  mean  tensile  strength 
of  this  thread. 

14.  An  airline  wishes  to  estimate  the  weight  of  the  paint  on  a  fully  painted  aircraft 
of  the  type  it  flies.  In  a  sample  of  four  repaintings  the  average  weight  of  the 
paint  applied  was  239  pounds,  with  sample  standard  deviation  8  pounds. 
Assuming  that  weights  of  paint  on  aircraft  are  normally  distributed,  construct 
a  99.8%  confidence  interval  for  the  mean  weight  of  paint  on  all  such  aircraft. 

15.  In  a  study  of  dummy  foal  syndrome,  the  average  time  between  birth  and  onset 
of  noticeable  symptoms  in  a  sample  of  six  foals  was  18.6  hours,  with  standard 
deviation  1.7  hours.  Assuming  that  the  time  to  onset  of  symptoms  in  all  foals  is 
normally  distributed,  construct  a  90%  confidence  interval  for  the  mean  time 
between  birth  and  onset  of  noticeable  symptoms. 

16.  A  sample  of  26  women’s  size  6  dresses  had  mean  waist  measurement  25.25 
inches  with  sample  standard  deviation  0.375  inch.  Construct  a  95%  confidence 
interval  for  the  mean  waist  measurement  of  all  size  6  women’s  dresses.  Assume 
waist  measurements  are  normally  distributed. 


ADDITIONAL  EXERCISES 


17.  Botanists  studying  attrition  among  saplings  in  new  growth  areas  of  forests 
diligently  counted  stems  in  six  plots  in  five-year-old  new  growth  areas, 
obtaining  the  following  counts  of  stems  per  acre: 

9,432  11,026  10,539 
8,773  9,868  10,247 

Construct  an  80%  confidence  interval  for  the  mean  number  of  stems  per  acre 
in  all  five-year-old  new  growth  areas  of  forests.  Assume  that  the  number  of 
stems  per  acre  is  normally  distributed. 

18.  Nutritionists  are  investigating  the  efficacy  of  a  diet  plan  designed  to  increase 
the  caloric  intake  of  elderly  people.  The  increase  in  daily  caloric  intake  in  12 
individuals  who  are  put  on  the  plan  is  (a  minus  sign  signifies  that  calories 
consumed  went  down): 
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121  284  -94  295  183  312 

188  -102  259  226  152  167 

Construct  a  99.8%  confidence  interval  for  the  mean  increase  in  caloric  intake 
for  all  people  who  are  put  on  this  diet.  Assume  that  population  of  differences 
in  intake  is  normally  distributed. 

19.  A  machine  for  making  precision  cuts  in  dimension  lumber  produces  studs  with 
lengths  that  vary  with  standard  deviation  0.003  inch.  Five  trial  cuts  are  made 
to  check  the  machine’s  calibration.  The  mean  length  of  the  studs  produced  is 
104.998  inches  with  sample  standard  deviation  0.004  inch.  Construct  a  99.5% 
confidence  interval  for  the  mean  lengths  of  all  studs  cut  by  this  machine. 
Assume  lengths  are  normally  distributed.  Hint:  Not  all  the  numbers  given  in 
the  problem  are  used. 

20.  The  variation  in  time  for  a  baked  good  to  go  through  a  conveyor  oven  at  a 
large  scale  bakery  has  standard  deviation  0.017  minute  at  every  time  setting. 
To  check  the  bake  time  of  the  oven  periodically  four  batches  of  goods  are 
carefully  timed.  The  recent  check  gave  a  mean  of  27.2  minutes  with  sample 
standard  deviation  0.012  minute.  Construct  a  99.8%  confidence  interval  for  the 
mean  bake  time  of  all  batches  baked  in  this  oven.  Assume  bake  times  are 
normally  distributed.  Hint:  Not  all  the  numbers  given  in  the  problem  are  used. 

21.  Wildlife  researchers  tranquilized  and  weighed  three  adult  male  polar  bears. 
The  data  (in  pounds)  are:  926,  742, 1,109.  Assume  the  weights  of  all  bears  are 
normally  distributed. 

a.  Construct  an  80%  confidence  interval  for  the  mean  weight  of  all  adult  male 
polar  bears  using  these  data. 

b.  Convert  the  three  weights  in  pounds  to  weights  in  kilograms  using  the 
conversion  1  lb  =  0.453  kg  (so  the  first  datum  changes  to 

(926)  (0.453)  =  419  ).  Use  the  converted  data  to  construct  an  80% 
confidence  interval  for  the  mean  weight  of  all  adult  male  polar  bears 
expressed  in  kilograms. 

c.  Convert  your  answer  in  part  (a)  into  kilograms  directly  and  compare  it  to 
your  answer  in  (b).  This  illustrates  that  if  you  construct  a  confidence 
interval  in  one  system  of  units  you  can  convert  it  directly  into  another 
system  of  units  without  having  to  convert  all  the  data  to  the  new  units. 

22.  Wildlife  researchers  trapped  and  measured  six  adult  male  collared  lemmings. 
The  data  (in  millimeters)  are:  104,  99, 112, 115,  96, 109.  Assume  the  lengths  of 
all  lemmings  are  normally  distributed. 

a.  Construct  a  90%  confidence  interval  for  the  mean  length  of  all  adult  male 
collared  lemmings  using  these  data. 
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b.  Convert  the  six  lengths  in  millimeters  to  lengths  in  inches  using  the 
conversion  1  mm  =  0.039  in  (so  the  first  datum  changes  to  (l04)(0.039)  = 
4.06).  Use  the  converted  data  to  construct  a  90%  confidence  interval  for  the 
mean  length  of  all  adult  male  collared  lemmings  expressed  in  inches. 

c.  Convert  your  answer  in  part  (a)  into  inches  directly  and  compare  it  to  your 
answer  in  (b).  This  illustrates  that  if  you  construct  a  confidence  interval  in 
one  system  of  units  you  can  convert  it  directly  into  another  system  of  units 
without  having  to  convert  all  the  data  to  the  new  units. 


ANSWERS 


i.  a.  98  ±3.9 

b.  98  ±5.2 

3.  a.  386  ±  16.4 

b.  386  ±  33.6 

5.  a.  933  ±  6.5 

b.  933  ±8.5 

c.  Asking  for  greater  confidence  requires  a  longer  interval. 

7.  32.7  ±2.9 
9.  47.3  ±2.19 
n.  53.5  ±0.56 
13.  8.2  ±0.04 
is.  18.6  ±  1.4 
17.  9981  ±486 
19.  104.998  ±  0.004 

2i.  a.  926  ±  200 

b.  419  ±90 

c.  419  ±91 
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7.3  Large  Sample  Estimation  of  a  Population  Proportion 


LEARNING  OBJECTIVE 

1.  To  understand  how  to  apply  the  formula  for  a  confidence  interval  for  a 
population  proportion. 


Since  from  Section  6.3  "The  Sample  Proportion"  in  Chapter  6  "Sampling 
Distributions"  we  know  the  mean,  standard  deviation,  and  sampling  distribution  of 
the  sample  proportion  p ,  the  ideas  of  the  previous  two  sections  can  be  applied  to 
produce  a  confidence  interval  for  a  population  proportion.  Here  is  the  formula. 


Large  Sample  1 00  (1  —  (X)  %  Confidence  Interval  for  a 
Population  Proportion 


P  ±  Za/2 


A  sample  is  large  if  the  interval  [/?— 3  O'-,  p  +  3  O'- j  lies  wholly  within  the 
interval  [0, 1  ]  . 


In  actual  practice  the  value  of  p  is  not  known,  hence  neither  is  O'-.  In  that  case  we 
substitute  the  known  quantity  p  for  p  in  making  the  check;  this  means  checking 
that  the  interval 


P~ 


3 


/\ 

P  + 


3 


lies  wholly  within  the  interval  [0,  l]  . 
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EXAMPLE  7 


To  estimate  the  proportion  of  students  at  a  large  college  who  are  female,  a 
random  sample  of  120  students  is  selected.  There  are  69  female  students  in 
the  sample.  Construct  a  90%  confidence  interval  for  the  proportion  of  all 
students  at  the  college  who  are  female. 

Solution: 

The  proportion  of  students  in  the  sample  who  are  female  is 

p  =  69  /  120  =  0.575. 

Confidence  level  90%  means  that  Ct  =  1  —  0.90  =  0.10  so 
a  /  2  =  0.05.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Co.05  =  1-645. 

Thus 


P  ±  Za/2 


=  0.575  ±  1.645- 


(0.575)  (0.425) 
120 


=  0.575  ±  0 


One  may  be  90%  confident  that  the  true  proportion  of  all  students  at  the 
college  who  are  female  is  contained  in  the  interval 


(0.575  -  0.  074,0.  575  +  0.074)  -  (0.  501,0.  649)  . 


KEY  TAKEAWAYS 


•  We  have  a  single  formula  for  a  confidence  interval  for  a  population 
proportion,  which  is  valid  when  the  sample  is  large. 

•  The  condition  that  a  sample  be  large  is  not  that  its  size  n  be  at  least  30, 
but  that  the  density  function  fit  inside  the  interval  [0, 1  ]  . 
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1.  Information  about  a  random  sample  is  given.  Verify  that  the  sample  is  large 
enough  to  use  it  to  construct  a  confidence  interval  for  the  population 
proportion.  Then  construct  a  90%  confidence  interval  for  the  population 
proportion. 

a.  n  =  25,/?  =  0.7 

b.  n  =  50,p  =  0.7 

2.  Information  about  a  random  sample  is  given.  Verify  that  the  sample  is  large 
enough  to  use  it  to  construct  a  confidence  interval  for  the  population 
proportion.  Then  construct  a  95%  confidence  interval  for  the  population 
proportion. 

a.  n  =  2500, p  =  0.22 

b.  n  =  1200, p  =  0.22 

3.  Information  about  a  random  sample  is  given.  Verify  that  the  sample  is  large 
enough  to  use  it  to  construct  a  confidence  interval  for  the  population 
proportion.  Then  construct  a  98%  confidence  interval  for  the  population 
proportion. 

a.  n  =  80,/?  =  0.4 

b.  n  -  325, p  =  0.4 

4.  Information  about  a  random  sample  is  given.  Verify  that  the  sample  is  large 
enough  to  use  it  to  construct  a  confidence  interval  for  the  population 
proportion.  Then  construct  a  99.5%  confidence  interval  for  the  population 
proportion. 

a.  n  =  200,/?  =  0.85 

b.  n  =  75,/?  =  0.85 

5.  In  a  random  sample  of  size  1,100,  338  have  the  characteristic  of  interest. 

a.  Compute  the  sample  proportion  p  with  the  characteristic  of  interest. 

b.  Verify  that  the  sample  is  large  enough  to  use  it  to  construct  a  confidence 
interval  for  the  population  proportion. 

c.  Construct  an  80%  confidence  interval  for  the  population  proportion  p. 

d.  Construct  a  90%  confidence  interval  for  the  population  proportion  p. 

e.  Comment  on  why  one  interval  is  longer  than  the  other. 
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6.  In  a  random  sample  of  size  2,400,  420  have  the  characteristic  of  interest. 

a.  Compute  the  sample  proportion  p  with  the  characteristic  of  interest. 

b.  Verify  that  the  sample  is  large  enough  to  use  it  to  construct  a  confidence 
interval  for  the  population  proportion. 

c.  Construct  a  90%  confidence  interval  for  the  population  proportion  p. 

d.  Construct  a  99%  confidence  interval  for  the  population  proportion  p. 

e.  Comment  on  why  one  interval  is  longer  than  the  other. 


APPLICATIONS 


7.  A  security  feature  on  some  web  pages  is  graphic  representations  of  words  that 
are  readable  by  human  beings  but  not  machines.  When  a  certain  design  format 
was  tested  on  450  subjects,  by  having  them  attempt  to  read  ten  disguised 
words,  448  subjects  could  read  all  the  words. 

a.  Give  a  point  estimate  of  the  proportion  p  of  all  people  who  could  read 
words  disguised  in  this  way. 

b.  Show  that  the  sample  is  not  sufficiently  large  to  construct  a  confidence 
interval  for  the  proportion  of  all  people  who  could  read  words  disguised  in 
this  way. 

8.  In  a  random  sample  of  900  adults,  42  defined  themselves  as  vegetarians. 

a.  Give  a  point  estimate  of  the  proportion  of  all  adults  who  would  define 
themselves  as  vegetarians. 

b.  Verify  that  the  sample  is  sufficiently  large  to  use  it  to  construct  a 
confidence  interval  for  that  proportion. 

c.  Construct  an  80%  confidence  interval  for  the  proportion  of  all  adults  who 
would  define  themselves  as  vegetarians. 

9.  In  a  random  sample  of  250  employed  people,  61  said  that  they  bring  work 
home  with  them  at  least  occasionally. 

a.  Give  a  point  estimate  of  the  proportion  of  all  employed  people  who  bring 
work  home  with  them  at  least  occasionally. 

b.  Construct  a  99%  confidence  interval  for  that  proportion. 

10.  In  a  random  sample  of  1,250  household  moves,  822  were  moves  to  a  location 
within  the  same  county  as  the  original  residence. 

a.  Give  a  point  estimate  of  the  proportion  of  all  household  moves  that  are  to  a 
location  within  the  same  county  as  the  original  residence. 

b.  Construct  a  98%  confidence  interval  for  that  proportion. 
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11.  In  a  random  sample  of  12,447  hip  replacement  or  revision  surgery  procedures 
nationwide,  162  patients  developed  a  surgical  site  infection. 

a.  Give  a  point  estimate  of  the  proportion  of  all  patients  undergoing  a  hip 
surgery  procedure  who  develop  a  surgical  site  infection. 

b.  Verify  that  the  sample  is  sufficiently  large  to  use  it  to  construct  a 
confidence  interval  for  that  proportion. 

c.  Construct  a  95%  confidence  interval  for  the  proportion  of  all  patients 
undergoing  a  hip  surgery  procedure  who  develop  a  surgical  site  infection. 

12.  In  a  certain  region  prepackaged  products  labeled  500  g  must  contain  on 
average  at  least  500  grams  of  the  product,  and  at  least  90%  of  all  packages  must 
weigh  at  least  490  grams.  In  a  random  sample  of  300  packages,  288  weighed  at 
least  490  grams. 

a.  Give  a  point  estimate  of  the  proportion  of  all  packages  that  weigh  at  least 
490  grams. 

b.  Verify  that  the  sample  is  sufficiently  large  to  use  it  to  construct  a 
confidence  interval  for  that  proportion. 

c.  Construct  a  99.8%  confidence  interval  for  the  proportion  of  all  packages 
that  weigh  at  least  490  grams. 

13.  A  survey  of  50  randomly  selected  adults  in  a  small  town  asked  them  if  their 
opinion  on  a  proposed  “no  cruising”  restriction  late  at  night.  Responses  were 
coded  1  for  in  favor,  0  for  indifferent,  and  2  for  opposed,  with  the  results 
shown  in  the  table. 

1  0  2  0  1  0  0  1  1  2 

0200010200 
0212000201 
0202010020 
1  0  0  1  2  0  0  2  1  2 

a.  Give  a  point  estimate  of  the  proportion  of  all  adults  in  the  community  who 
are  indifferent  concerning  the  proposed  restriction. 

b.  Assuming  that  the  sample  is  sufficiently  large,  construct  a  90%  confidence 
interval  for  the  proportion  of  all  adults  in  the  community  who  are 
indifferent  concerning  the  proposed  restriction. 

14.  To  try  to  understand  the  reason  for  returned  goods,  the  manager  of  a  store 
examines  the  records  on  40  products  that  were  returned  in  the  last  year. 
Reasons  were  coded  by  1  for  “defective,”  2  for  “unsatisfactory,”  and  0  for  all 
other  reasons,  with  the  results  shown  in  the  table. 
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020000020 

000000000 

002000020 

000001000 


0 

2 

0 

0 


a.  Give  a  point  estimate  of  the  proportion  of  all  returns  that  are  because  of 
something  wrong  with  the  product,  that  is,  either  defective  or  performed 
unsatisfactorily. 

b.  Assuming  that  the  sample  is  sufficiently  large,  construct  an  80% 
confidence  interval  for  the  proportion  of  all  returns  that  are  because  of 
something  wrong  with  the  product. 


15.  In  order  to  estimate  the  proportion  of  entering  students  who  graduate  within 
six  years,  the  administration  at  a  state  university  examined  the  records  of  600 
randomly  selected  students  who  entered  the  university  six  years  ago,  and 
found  that  312  had  graduated. 


a.  Give  a  point  estimate  of  the  six-year  graduation  rate,  the  proportion  of 
entering  students  who  graduate  within  six  years. 

b.  Assuming  that  the  sample  is  sufficiently  large,  construct  a  98%  confidence 
interval  for  the  six-year  graduation  rate. 

16.  In  a  random  sample  of  2,300  mortgages  taken  out  in  a  certain  region  last  year, 
187  were  adjustable-rate  mortgages. 


a.  Give  a  point  estimate  of  the  proportion  of  all  mortgages  taken  out  in  this 
region  last  year  that  were  adjustable-rate  mortgages. 

b.  Assuming  that  the  sample  is  sufficiently  large,  construct  a  99.9% 
confidence  interval  for  the  proportion  of  all  mortgages  taken  out  in  this 
region  last  year  that  were  adjustable-rate  mortgages. 

17.  In  a  research  study  in  cattle  breeding,  159  of  273  cows  in  several  herds  that 
were  in  estrus  were  detected  by  means  of  an  intensive  once  a  day,  one-hour 
observation  of  the  herds  in  early  morning. 


a.  Give  a  point  estimate  of  the  proportion  of  all  cattle  in  estrus  who  are 
detected  by  this  method. 

b.  Assuming  that  the  sample  is  sufficiently  large,  construct  a  90%  confidence 
interval  for  the  proportion  of  all  cattle  in  estrus  who  are  detected  by  this 
method. 


18.  A  survey  of  21,250  households  concerning  telephone  service  gave  the  results 
shown  in  the  table. 
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Landline 

No  Landline 

Cell  phone 

12,474 

5,844 

No  cell  phone 

2,529 

403 

a.  Give  a  point  estimate  for  the  proportion  of  all  households  in  which  there  is 
a  cell  phone  but  no  landline. 

b.  Assuming  the  sample  is  sufficiently  large,  construct  a  99.9%  confidence 
interval  for  the  proportion  of  all  households  in  which  there  is  a  cell  phone 
but  no  landline. 

c.  Give  a  point  estimate  for  the  proportion  of  all  households  in  which  there  is 
no  telephone  service  of  either  kind. 

d.  Assuming  the  sample  is  sufficiently  large,  construct  a  99.9%  confidence 
interval  for  the  proportion  of  all  all  households  in  which  there  is  no 
telephone  service  of  either  kind. 


ADDITIONAL  EXERCISES 


19.  In  a  random  sample  of  900  adults,  42  defined  themselves  as  vegetarians.  Of 
these  42,  29  were  women. 

a.  Give  a  point  estimate  of  the  proportion  of  all  self-described  vegetarians 
who  are  women. 

b.  Verify  that  the  sample  is  sufficiently  large  to  use  it  to  construct  a 
confidence  interval  for  that  proportion. 

c.  Construct  a  90%  confidence  interval  for  the  proportion  of  all  all  self- 
described  vegetarians  who  are  women. 

20.  A  random  sample  of  185  college  soccer  players  who  had  suffered  injuries  that 
resulted  in  loss  of  playing  time  was  made  with  the  results  shown  in  the  table. 
Injuries  are  classified  according  to  severity  of  the  injury  and  the  condition 
under  which  it  was  sustained. 


Minor 

Moderate 

Serious 

Practice 

48 

20 

6 

Game 

62 

32 

17 

a.  Give  a  point  estimate  for  the  proportion  p  of  all  injuries  to  college  soccer 
players  that  are  sustained  in  practice. 

b.  Construct  a  95%  confidence  interval  for  the  proportion  p  of  all  injuries  to 
college  soccer  players  that  are  sustained  in  practice. 

c.  Give  a  point  estimate  for  the  proportion  p  of  all  injuries  to  college  soccer 
players  that  are  either  moderate  or  serious. 
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d.  Construct  a  95%  confidence  interval  for  the  proportion  p  of  all  injuries  to 
college  soccer  players  that  are  either  moderate  or  serious. 

21.  The  body  mass  index  (BMl)  was  measured  in  1,200  randomly  selected  adults, 
with  the  results  shown  in  the  table. 


BMI 

Under  18.5 

18.5-25 

Over  25 

Men 

36 

165 

315 

Women 

75 

274 

335 

a.  Give  a  point  estimate  for  the  proportion  of  all  men  whose  BMI  is  over  25. 

b.  Assuming  the  sample  is  sufficiently  large,  construct  a  99%  confidence 
interval  for  the  proportion  of  all  men  whose  BMI  is  over  25. 

c.  Give  a  point  estimate  for  the  proportion  of  all  adults,  regardless  of  gender, 
whose  BMI  is  over  25. 

d.  Assuming  the  sample  is  sufficiently  large,  construct  a  99%  confidence 
interval  for  the  proportion  of  all  adults,  regardless  of  gender,  whose  BMI  is 
over  25. 


22.  Confidence  intervals  constructed  using  the  formula  in  this  section  often  do  not 
do  as  well  as  expected  unless  n  is  quite  large,  especially  when  the  true 
population  proportion  is  close  to  either  0  or  1.  In  such  cases  a  better  result  is 
obtained  by  adding  two  successes  and  two  failures  to  the  actual  data  and  then 
computing  the  confidence  interval.  This  is  the  same  as  using  the  formula 


P  ±  Za/ 2 


where 
~  a  +  2 

p  =  -  and  n  =  n  +  4 

n  +  4 

Suppose  that  in  a  random  sample  of  600  households,  12  had  no  telephone 
service  of  any  kind.  Use  the  adjusted  confidence  interval  procedure  just 
described  to  form  a  99.9%  confidence  interval  for  the  proportion  of  all 
households  that  have  no  telephone  service  of  any  kind. 
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LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Sets  4  and  4A  list  the  results  of  500  tosses  of  a  die.  Let  p  denote  the 
proportion  of  all  tosses  of  this  die  that  would  result  in  a  four.  Use  the  sample 
data  to  construct  a  90%  confidence  interval  for  p. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4A.xls 

24.  Large  Data  Set  6  records  results  of  a  random  survey  of  200  voters  in  each  of  two 
regions,  in  which  they  were  asked  to  express  whether  they  prefer  Candidate  A 
for  a  U.S.  Senate  seat  or  prefer  some  other  candidate.  Use  the  full  data  set  (400 
observations)  to  construct  a  98%  confidence  interval  for  the  proportion  p  of  all 
voters  who  prefer  Candidate  A. 

http://www.gone.2012books.lardbucket.org/sites/all/files/data6.xls 

25.  Lines  2  through  536  in  Large  Data  Set  11  is  a  sample  of  535  real  estate  sales  in  a 
certain  region  in  2008.  Those  that  were  foreclosure  sales  are  identified  with  a  1 
in  the  second  column. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datall.xls 

a.  Use  these  data  to  construct  a  point  estimate  p  of  the  proportion  p  of  all 
real  estate  sales  in  this  region  in  2008  that  were  foreclosure  sales. 

b.  Use  these  data  to  construct  a  90%  confidence  for  p. 

26.  Lines  537  through  1106  in  Large  Data  Set  11  is  a  sample  of  570  real  estate  sales 
in  a  certain  region  in  2010.  Those  that  were  foreclosure  sales  are  identified 
with  a  1  in  the  second  column. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datall.xls 

a.  Use  these  data  to  construct  a  point  estimate  p  of  the  proportion  p  of  all 
real  estate  sales  in  this  region  in  2010  that  were  foreclosure  sales. 

b.  Use  these  data  to  construct  a  90%  confidence  for  p. 
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ANSWERS 


a.  (0.5492,0.8508) 

b.  (0.5934,0.8066) 

a.  (0.2726,0.5274) 

b.  (0.3368,0.4632) 

a.  0.3073 


b.  p  ±3\/  —  =  0.31  ±0.04 

'  —  V  n  — 


11. 


13. 


15. 


17. 


and 

[0.27,0.35]  c  [0,1] 

c.  (0.2895,0.3251) 

d.  (0.2844,0.3302) 

e.  Asking  for  greater  confidence  requires  a  longer  interval. 

a.  0.9956 

b.  (0.9862,1.005) 

a.  0.244 

b.  (0.1740,0.3140) 

a.  0.013 

b.  (0.01,0.016) 

c.  (0.011,0.015) 

a.  0.52 

b.  (0.4038,0.6362) 

a.  0.52 

b.  (0.4726,0.5674) 

a.  0.5824 

b.  (0.5333,0.6315) 


19. 


a.  0.69 


b.  p±3yj^-  =0.69  ±0.21 

and 

[0.48,0.90]  C  [0,1] 

c.  0.69  ±0.12 
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21.  a.  0.6105 

b.  (0.5552,0.6658) 

c.  0.5583 

d.  (0.5214,0.5952) 


23.  (0.1368,0.1912) 

25.  a.  p  =  0.2280 

b.  (0.1982,0.2579) 
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7.4  Sample  Size  Considerations 


LEARNING  OBJECTIVE 

1.  To  learn  how  to  apply  formulas  for  estimating  the  size  sample  that  will 
be  needed  in  order  to  construct  a  confidence  interval  for  a  population 
mean  or  proportion  that  meets  given  criteria. 


Sampling  is  typically  done  with  a  set  of  clear  objectives  in  mind.  For  example,  an 
economist  might  wish  to  estimate  the  mean  yearly  income  of  workers  in  a 
particular  industry  at  90%  confidence  and  to  within  $500.  Since  sampling  costs  time, 
effort,  and  money,  it  would  be  useful  to  be  able  to  estimate  the  smallest  size  sample 
that  is  likely  to  meet  these  criteria. 

Estimating  i± 

The  confidence  interval  formulas  for  estimating  a  population  mean  /i  have  the  form 
x  ±  E.  When  the  population  standard  deviation  o  is  known, 

£  _  -^a/2'7 


The  number  za/2  is  determined  by  the  desired  level  of  confidence.  To  say  that  we 
wish  to  estimate  the  mean  to  within  a  certain  number  of  units  means  that  we  want 
the  margin  of  error  E  to  be  no  larger  than  that  number.  Thus  we  obtain  the 
minimum  sample  size  needed  by  solving  the  displayed  equation  for  n. 


Minimum  Sample  Size  for  Estimating  a  Population  Mean 


The  estimated  minimum  sample  size  n  needed  to  estimate  a  population  mean  /i 
to  within  E  units  at  100  (1  —  a)%  confidence  is 


n  = 


(rounded  up) 
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To  apply  the  formula  we  must  have  prior  knowledge  of  the  population  in  order  to 
have  an  estimate  of  its  standard  deviation  a.  In  all  the  examples  and  exercises  the 
population  standard  deviation  will  be  given. 


EXAMPLE  8 


Find  the  minimum  sample  size  necessary  to  construct  a  99%  confidence 
interval  for  /i  with  a  margin  of  error  E  =  0.2.  Assume  that  the  population 
standard  deviation  is  a  =  1.3. 


Solution: 


Confidence  level  99%  means  that  Ct  =  1  —  0.99  =  0.01  so 
a  /  2  =  0.005.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Co.005  —  2.576.  Thus 


n  = 


(2.576)  ~(1.3)2 

- - - — - - =  280.361536 

(0.2)2 


which  we  round  up  to  281,  since  it  is  impossible  to  take  a  fractional 
observation. 
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EXAMPLE  9 


An  economist  wishes  to  estimate,  with  a  95%  confidence  interval,  the  yearly 
income  of  welders  with  at  least  five  years  experience  to  within  $1,000.  He 
estimates  that  the  range  of  incomes  is  no  more  than  $24,000,  so  using  the 
Empirical  Rule  he  estimates  the  population  standard  deviation  to  be  about 
one-sixth  as  much,  or  about  $4,000.  Find  the  estimated  minimum  sample 
size  required. 

Solution: 


Confidence  level  95%  means  that  a  —  \  —  0.95  =  0.05  so 
a  /  2  =  0.025.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Co.025  —  1-960. 


To  say  that  the  estimate  is  to  be  “to  within  $1,000”  means  that  E  =  1000.  Thus 


n  = 


(1.960)  2  (4000) 2 
(1000)2 


61.4656 


which  we  round  up  to  62. 


Estimating  p 

The  confidence  interval  formula  for  estimating  a  population  proportion  p  is  p  ±  E, 
where 


E  —  Ca/2 


The  number  za/ 2  is  determined  by  the  desired  level  of  confidence.  To  say  that  we 
wish  to  estimate  the  population  proportion  to  within  a  certain  number  of 
percentage  points  means  that  we  want  the  margin  of  error  £  to  be  no  larger  than 
that  number  (expressed  as  a  proportion).  Thus  we  obtain  the  minimum  sample  size 
needed  by  solving  the  displayed  equation  for  n. 
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Minimum  Sample  Size  for  Estimating  a  Population 
Proportion 


The  estimated  minimum  sample  size  n  needed  to  estimate  a  population 
proportion  p  to  within  E  at  100  (1  —  af/o  confidence  is 


(rounded  up) 


There  is  a  dilemma  here:  the  formula  for  estimating  how  large  a  sample  to  take 
contains  the  number  p ,  which  we  know  only  after  we  have  taken  the  sample.  There 
are  two  ways  out  of  this  dilemma.  Typically  the  researcher  will  have  some  idea  as  to 
the  value  of  the  population  proportion  p,  hence  of  what  the  sample  proportion  p  is 
likely  to  be.  For  example,  if  last  month  37%  of  all  voters  thought  that  state  taxes  are 
too  high,  then  it  is  likely  that  the  proportion  with  that  opinion  this  month  will  not 
be  dramatically  different,  and  we  would  use  the  value  0.37  for  p  in  the  formula. 


The  second  approach  to  resolving  the  dilemma  is  simply  to  replace  p  in  the  formula 
by  0.5.  This  is  because  if  p  is  large  then  1  —  p  is  small,  and  vice  versa,  which  limits 
their  product  to  a  maximum  value  of  0.25,  which  occurs  when p  —  0.5.  This  is 
called  the  most  conservative  estimate6,  since  it  gives  the  largest  possible  estimate 
of  n. 


6.  The  estimate  obtained  using 
p  =  0.5,  which  gives  the 
largest  estimate  of  n. 
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EXAMPLE  10 


Find  the  necessary  minimum  sample  size  to  construct  a  98%  confidence 
interval  for  p  with  a  margin  of  error  E  =  0.05, 

a.  assuming  that  no  prior  knowledge  about  p  is  available;  and 

b.  assuming  that  prior  studies  suggest  that  p  is  about  0.1. 

Solution: 

Confidence  level  98%  means  that  a  =  1  —  0.98  =  0.02  so 
a  I  2  =  0.01.  From  the  last  line  of  Figure  12.3  "Critical  Values  of"  we 

obtain  Co.01  —  2.326. 


a.  Since  there  is  no  prior  knowledge  of  p  we  make  the  most 
conservative  estimate  that  p  =  0.5.  Then 


n  = 


(za/ifp  (1  -p)  (2.326)  2  (0.5)  (1  -  0.5) 


0.05" 


=  541.0276 


which  we  round  up  to  542. 


b.  Since  p  ~  0.1  we  estimate  p  by  0.1,  and  obtain 


n  = 


(2.326)  2  (0.1)  (1  -  0.1) 
0.052 


194.769936 


which  we  round  up  to  195. 
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EXAMPLE  11 


A  dermatologist  wishes  to  estimate  the  proportion  of  young  adults  who 
apply  sunscreen  regularly  before  going  out  in  the  sun  in  the  summer.  Find 
the  minimum  sample  size  required  to  estimate  the  proportion  to  within 
three  percentage  points,  at  90%  confidence. 

Solution: 

Confidence  level  90%  means  that  Ct  =  1  —  0.90  =  0.10  so 
a  /  2  =  0.05.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Co.05  —  1-645. 

Since  there  is  no  prior  knowledge  of  p  we  make  the  most  conservative 
estimate  that  p  =  0.5.  To  estimate  “to  within  three  percentage  points” 
means  that  E  =  0.03.  Then 

Za/ifpjl-p)  _  (1.645) 2  (0.5)  (1  -0.5 
E 2  ~  0.032 

which  we  round  up  to  752. 


=  751.6736111 


KEY  TAKEAWAYS 


•  If  the  population  standard  deviation  o  is  known  or  can  be  estimated, 
then  the  minimum  sample  size  needed  to  obtain  a  confidence  interval 
for  the  population  mean  with  a  given  maximum  error  of  the  estimate 
and  a  given  level  of  confidence  can  be  estimated. 

•  The  minimum  sample  size  needed  to  obtain  a  confidence  interval  for  a 
population  proportion  with  a  given  maximum  error  of  the  estimate  and 
a  given  level  of  confidence  can  always  be  estimated,  if  there  is  prior 
knowledge  of  the  population  proportion  p  then  the  estimate  can  be 
sharpened. 
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1.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  mean  of  a  population  having  the  standard  deviation  shown,  meeting  the 
criteria  given. 

a.  a  =  30,  95%  confidence,  E  =  10 

b.  a  =  30,  99%  confidence,  E  =  10 

c.  a  =  30,  95%  confidence,  E  =  5 

2.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  mean  of  a  population  having  the  standard  deviation  shown,  meeting  the 
criteria  given. 

a.  a  =  4,  95%  confidence,  E  =  1 

b.  a  =  4,  99%  confidence,  E  =  1 

c.  <7  =  4,  95%  confidence,  E  =  0.5 

3.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  proportion  of  a  population  that  has  a  particular  characteristic,  meeting  the 
criteria  given. 

a.  p  «  0.37,  80%  confidence,  E  =  0.05 

b.  p  «  0.37,  90%  confidence,  E  =  0.05 

c.  p  «  0.37,  80%  confidence,  E  =  0.01 

4.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  proportion  of  a  population  that  has  a  particular  characteristic,  meeting  the 
criteria  given. 

a.  p  =  0.81,  95%  confidence,  E  =  0.02 

b.  p  =  0.81,  99%  confidence,  E  =  0.02 

c.  p  =  0.81,  95%  confidence,  E  =  0.01 

5.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  proportion  of  a  population  that  has  a  particular  characteristic,  meeting  the 
criteria  given. 

a.  80%  confidence,  E  =  0.05 

b.  90%  confidence,  E  =  0.05 

c.  80%  confidence,  E  =  0.01 
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6.  Estimate  the  minimum  sample  size  needed  to  form  a  confidence  interval  for 
the  proportion  of  a  population  that  has  a  particular  characteristic,  meeting  the 
criteria  given. 

a.  95%  confidence,  E  =  0.02 

b.  99%  confidence,  E  =  0.02 

c.  95%  confidence,  E  =  0.01 


APPLICATIONS 


7.  A  software  engineer  wishes  to  estimate,  to  within  5  seconds,  the  mean  time 
that  a  new  application  takes  to  start  up,  with  95%  confidence.  Estimate  the 
minimum  size  sample  required  if  the  standard  deviation  of  start  up  times  for 
similar  software  is  12  seconds. 

8.  A  real  estate  agent  wishes  to  estimate,  to  within  $2.50,  the  mean  retail  cost  per 
square  foot  of  newly  built  homes,  with  80%  confidence.  He  estimates  the 
standard  deviation  of  such  costs  at  $5.00.  Estimate  the  minimum  size  sample 
required. 

9.  An  economist  wishes  to  estimate,  to  within  2  minutes,  the  mean  time  that 
employed  persons  spend  commuting  each  day,  with  95%  confidence.  On  the 
assumption  that  the  standard  deviation  of  commuting  times  is  8  minutes, 
estimate  the  minimum  size  sample  required. 

10.  A  motor  club  wishes  to  estimate,  to  within  1  cent,  the  mean  price  of  1  gallon  of 
regular  gasoline  in  a  certain  region,  with  98%  confidence.  Historically  the 
variability  of  prices  is  measured  by  £7  =  50.03.  Estimate  the  minimum  size 
sample  required. 

11.  A  bank  wishes  to  estimate,  to  within  $25,  the  mean  average  monthly  balance  in 
its  checking  accounts,  with  99.8%  confidence.  Assuming  <7  =  $250 ,  estimate 
the  minimum  size  sample  required. 

12.  A  retailer  wishes  to  estimate,  to  within  15  seconds,  the  mean  duration  of 
telephone  orders  taken  at  its  call  center,  with  99.5%  confidence.  In  the  past  the 
standard  deviation  of  call  length  has  been  about  1.25  minutes.  Estimate  the 
minimum  size  sample  required.  (Be  careful  to  express  all  the  information  in 
the  same  units.) 

13.  The  administration  at  a  college  wishes  to  estimate,  to  within  two  percentage 
points,  the  proportion  of  all  its  entering  freshmen  who  graduate  within  four 
years,  with  90%  confidence.  Estimate  the  minimum  size  sample  required. 
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14.  A  chain  of  automotive  repair  stores  wishes  to  estimate,  to  within  five 
percentage  points,  the  proportion  of  all  passenger  vehicles  in  operation  that 
are  at  least  five  years  old,  with  98%  confidence.  Estimate  the  minimum  size 
sample  required. 

15.  An  internet  service  provider  wishes  to  estimate,  to  within  one  percentage 
point,  the  current  proportion  of  all  email  that  is  spam,  with  99.9%  confidence. 
Last  year  the  proportion  that  was  spam  was  71%.  Estimate  the  minimum  size 
sample  required. 

16.  An  agronomist  wishes  to  estimate,  to  within  one  percentage  point,  the 
proportion  of  a  new  variety  of  seed  that  will  germinate  when  planted,  with  95% 
confidence.  A  typical  germination  rate  is  97%.  Estimate  the  minimum  size 
sample  required. 

17.  A  charitable  organization  wishes  to  estimate,  to  within  half  a  percentage  point, 
the  proportion  of  all  telephone  solicitations  to  its  donors  that  result  in  a  gift, 
with  90%  confidence.  Estimate  the  minimum  sample  size  required,  using  the 
information  that  in  the  past  the  response  rate  has  been  about  30%. 

18.  A  government  agency  wishes  to  estimate  the  proportion  of  drivers  aged  16-24 
who  have  been  involved  in  a  traffic  accident  in  the  last  year.  It  wishes  to  make 
the  estimate  to  within  one  percentage  point  and  at  90%  confidence.  Find  the 
minimum  sample  size  required,  using  the  information  that  several  years  ago 
the  proportion  was  0.12. 


ADDITIONAL  EXERCISES 


19.  An  economist  wishes  to  estimate,  to  within  six  months,  the  mean  time  between 
sales  of  existing  homes,  with  95%  confidence.  Estimate  the  minimum  size 
sample  required.  In  his  experience  virtually  all  houses  are  re-sold  within  40 
months,  so  using  the  Empirical  Rule  he  will  estimate  a  by  one-sixth  the  range, 

or  40/6  =  6.7. 

20.  A  wildlife  manager  wishes  to  estimate  the  mean  length  of  fish  in  a  large  lake, 
to  within  one  inch,  with  80%  confidence.  Estimate  the  minimum  size  sample 
required.  In  his  experience  virtually  no  fish  caught  in  the  lake  is  over  23  inches 
long,  so  using  the  Empirical  Rule  he  will  estimate  <7  by  one-sixth  the  range,  or 

23/6  =  3.8. 

21.  You  wish  to  estimate  the  current  mean  birth  weight  of  all  newborns  in  a 
certain  region,  to  within  1  ounce  (l/l6  pound)  and  with  95%  confidence.  A 
sample  will  cost  $400  plus  $1.50  for  every  newborn  weighed.  You  believe  the 
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standard  deviations  of  weight  to  be  no  more  than  1.25  pounds.  You  have  $2,500 
to  spend  on  the  study. 

a.  Can  you  afford  the  sample  required? 

b.  if  not,  what  are  your  options? 

22.  You  wish  to  estimate  a  population  proportion  to  within  three  percentage 
points,  at  95%  confidence.  A  sample  will  cost  $500  plus  50  cents  for  every 
sample  element  measured.  You  have  $1,000  to  spend  on  the  study. 

a.  Can  you  afford  the  sample  required? 

b.  If  not,  what  are  your  options? 


ANSWERS 


a. 

35 

b. 

60 

c. 

139 

a. 

154 

b. 

253 

c. 

3832 

a. 

165 

b. 

271 

c. 

4109 

23 

62 

955 

13.  1692 
15.  22,301 
17.  22,731 
19.  5 

21.  a.  no 


b.  decrease  the  confidence  level 
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Testing  Hypotheses 


A  manufacturer  of  emergency  equipment  asserts  that  a  respirator  that  it  makes 
delivers  pure  air  for  75  minutes  on  average.  A  government  regulatory  agency  is 
charged  with  testing  such  claims,  in  this  case  to  verify  that  the  average  time  is  not 
less  than  75  minutes.  To  do  so  it  would  select  a  random  sample  of  respirators, 
compute  the  mean  time  that  they  deliver  pure  air,  and  compare  that  mean  to  the 
asserted  time  75  minutes. 

In  the  sampling  that  we  have  studied  so  far  the  goal  has  been  to  estimate  a 
population  parameter.  But  the  sampling  done  by  the  government  agency  has  a 
somewhat  different  objective,  not  so  much  to  estimate  the  population  mean  fa  as  to 
test  an  assertion— or  a  hypothesis1— about  it,  namely,  whether  it  is  as  large  as  75  or 
not.  The  agency  is  not  necessarily  interested  in  the  actual  value  of  fi,  just  whether  it 
is  as  claimed.  Their  sampling  is  done  to  perform  a  test  of  hypotheses,  the  subject  of 
this  chapter. 


1.  A  statement  about  a  population 
parameter. 
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8.1  The  Elements  of  Hypothesis  Testing 


LEARNING  OBJECTIVES 

1.  To  understand  the  logical  framework  of  tests  of  hypotheses. 

2.  To  learn  basic  terminology  connected  with  hypothesis  testing. 

3.  To  learn  fundamental  facts  about  hypothesis  testing. 


Types  of  Hypotheses 

A  hypothesis  about  the  value  of  a  population  parameter  is  an  assertion  about  its 
value.  As  in  the  introductory  example  we  will  be  concerned  with  testing  the  truth  of 
two  competing  hypotheses,  only  one  of  which  can  be  true. 


2.  The  statement  that  is  assumed 
to  be  true  unless  there  is 
convincing  evidence  to  the 
contrary. 

3.  A  statement  that  is  accepted  as 
true  only  if  there  is  convincing 
evidence  in  favor  of  it. 

4.  A  statistical  procedure  in 
which  a  choice  is  made 
between  a  null  hypothesis  and 
a  specific  alternative 
hypothesis  based  on 
information  in  a  sample. 


The  end  result  of  a  hypotheses  testing  procedure  is  a  choice  of  one  of  the  following 
two  possible  conclusions: 
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1.  Reject  Ho  (and  therefore  accept  Ha ),  or 

2.  Fail  to  reject  Ho  (and  therefore  fail  to  accept  Ha). 


The  null  hypothesis  typically  represents  the  status  quo,  or  what  has  historically 
been  true.  In  the  example  of  the  respirators,  we  would  believe  the  claim  of  the 
manufacturer  unless  there  is  reason  not  to  do  so,  so  the  null  hypotheses  is 
Ho  :  [a i  —  75. The  alternative  hypothesis  in  the  example  is  the  contradictory 
statement  Ha  :  ft  <  75. The  null  hypothesis  will  always  be  an  assertion  containing 
an  equals  sign,  but  depending  on  the  situation  the  alternative  hypothesis  can  have 
any  one  of  three  forms:  with  the  symbol  “<,”  as  in  the  example  just  discussed,  with 
the  symbol  “>,”  or  with  the  symbol  ‘V”  The  following  two  examples  illustrate  the 
latter  two  cases. 


EXAMPLE  1 


A  publisher  of  college  textbooks  claims  that  the  average  price  of  all 
hardbound  college  textbooks  is  $127.50.  A  student  group  believes  that  the 
actual  mean  is  higher  and  wishes  to  test  their  belief.  State  the  relevant  null 
and  alternative  hypotheses. 

Solution: 

The  default  option  is  to  accept  the  publisher’s  claim  unless  there  is 
compelling  evidence  to  the  contrary.  Thus  the  null  hypothesis  is 
Hq  :  [A  =  127.50.  Since  the  student  group  thinks  that  the  average 
textbook  price  is  greater  than  the  publisher’s  figure,  the  alternative 
hypothesis  in  this  situation  is  Ha  fl  >  127.50. 
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EXAMPLE  2 


The  recipe  for  a  bakery  item  is  designed  to  result  in  a  product  that  contains 
8  grams  of  fat  per  serving.  The  quality  control  department  samples  the 
product  periodically  to  insure  that  the  production  process  is  working  as 
designed.  State  the  relevant  null  and  alternative  hypotheses. 

Solution: 

The  default  option  is  to  assume  that  the  product  contains  the  amount  of  fat 
it  was  formulated  to  contain  unless  there  is  compelling  evidence  to  the 
contrary.  Thus  the  null  hypothesis  is  Hq  pi  =  8.0.  Since  to  contain 
either  more  fat  than  desired  or  to  contain  less  fat  than  desired  are  both  an 
indication  of  a  faulty  production  process,  the  alternative  hypothesis  in  this 
situation  is  that  the  mean  is  different  from  8.0,  so  Ha  pi  ^  8.0. 


In  Note  8.8  "Example  1",  the  textbook  example,  it  might  seem  more  natural  that  the 
publisher’s  claim  be  that  the  average  price  is  at  most  $127.50,  not  exactly  $127.50.  if 
the  claim  were  made  this  way,  then  the  null  hypothesis  would  be  Hq  :  pi  <  127. 5Q 
and  the  value  $127.50  given  in  the  example  would  be  the  one  that  is  least  favorable 
to  the  publisher’s  claim,  the  null  hypothesis.  It  is  always  true  that  if  the  null 
hypothesis  is  retained  for  its  least  favorable  value,  then  it  is  retained  for  every 
other  value. 


Thus  in  order  to  make  the  null  and  alternative  hypotheses  easy  for  the  student  to 
distinguish,  in  every  example  and  problem  in  this  text  we  will  always  present  one  of 
the  two  competing  claims  about  the  value  of  a  parameter  with  an  equality.  The  claim 
expressed  with  an  equality  is  the  null  hypothesis.  This  is  the  same  as  always  stating  the 
null  hypothesis  in  the  least  favorable  light.  So  in  the  introductory  example  about 
the  respirators,  we  stated  the  manufacturer’s  claim  as  “the  average  is  75  minutes” 
instead  of  the  perhaps  more  natural  “the  average  is  at  least  75  minutes,”  essentially 
reducing  the  presentation  of  the  null  hypothesis  to  its  worst  case. 


The  first  step  in  hypothesis  testing  is  to  identify  the  null  and  alternative 
hypotheses. 

The  Logic  of  Hypothesis  Testing 

Although  we  will  study  hypothesis  testing  in  situations  other  than  for  a  single 
population  mean  (for  example,  for  a  population  proportion  instead  of  a  mean  or  in 
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comparing  the  means  of  two  different  populations),  in  this  section  the  discussion 
will  always  be  given  in  terms  of  a  single  population  mean  fu. 

The  null  hypothesis  always  has  the  form  Hq  :  [i  —  for  a  specific  number  //()  (in 
the  respirator  example  =  75,  in  the  textbook  example  =  127.50,  and  in  the 
baked  goods  example  f. i0  —  8.0).  Since  the  null  hypothesis  is  accepted  unless  there 
is  strong  evidence  to  the  contrary,  the  test  procedure  is  based  on  the  initial 
assumption  that  Ho  is  true.  This  point  is  so  important  that  we  will  repeat  it  in  a 
display: 


The  test  procedure  is  based  on  the  initial  assumption  that  Hq  is  true. 


The  criterion  forjudging  between  Ho  and  Ha  based  on  the  sample  data  is:  if  the 
value  of  X  would  be  highly  unlikely  to  occur  if  Ho  were  true,  but  favors  the  truth  of 
Ha,  then  we  reject  Ho  in  favor  of  Ha.  Otherwise  we  do  not  reject  Ho. 

Supposing  for  now  that  X  follows  a  normal  distribution,  when  the  null  hypothesis  is 
true  the  density  function  for  the  sample  mean  X  must  be  as  in  Figure  8.1  "The 
Density  Curve  for  ":  a  bell  curve  centered  at  pi0 .  Thus  if  Ho  is  true  then  X  is  likely  to 
take  a  value  near  //0  and  is  unlikely  to  take  values  far  away.  Our  decision  procedure 
therefore  reduces  simply  to: 

1.  if  Ha  has  the  form  Ha  :  p  <  p(]  then  reject  Ho  if  J  is  far  to  the  left  of  ju0; 

2.  if  Ha  has  the  form  Ha  :  p  >  p{)  then  reject  Ho  if  x  is  far  to  the  right  of 

Ho', 

3.  if  Ha  has  the  form  Ha  :  p  /  p{)  then  reject  Ho  if  x  is  far  away  from 
in  either  direction. 
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5.  An  interval  or  union  of 
intervals  such  that  the  null 
hypothesis  is  rejected  if  and 
only  if  the  statistic  of  interest 
lies  in  this  region. 


Figure  8.1  The  Density  Curve  for  X  if  Ho  Is  True 


Think  of  the  respirator  example,  for  which  the  null  hypothesis  is  Hq  \  fi  —  15,  the 
claim  that  the  average  time  air  is  delivered  for  all  respirators  is  75  minutes,  if  the 
sample  mean  is  75  or  greater  then  we  certainly  would  not  reject  Ho  (since  there  is 
no  issue  with  an  emergency  respirator  delivering  air  even  longer  than  claimed). 


If  the  sample  mean  is  slightly  less  than  75  then  we  would  logically  attribute  the 
difference  to  sampling  error  and  also  not  reject  Hq  either. 


Values  of  the  sample  mean  that  are  smaller  and  smaller  are  less  and  less  likely  to 
come  from  a  population  for  which  the  population  mean  is  75.  Thus  if  the  sample 
mean  is  far  less  than  75,  say  around  60  minutes  or  less,  then  we  would  certainly 
reject  Ho,  because  we  know  that  it  is  highly  unlikely  that  the  average  of  a  sample 
would  be  so  low  if  the  population  mean  were  75.  This  is  the  rare  event  criterion  for 
rejection:  what  we  actually  observed  (X  <  60)  would  be  so  rare  an  event  if  fi  =  75 
were  true  that  we  regard  it  as  much  more  likely  that  the  alternative  hypothesis  ft  < 
75  holds. 


In  summary,  to  decide  between  Ho  and  Ha  in  this  example  we  would  select  a 
“rejection  region5”  of  values  sufficiently  far  to  the  left  of  75,  based  on  the  rare 
event  criterion,  and  reject  Ho  if  the  sample  mean  X  lies  in  the  rejection  region,  but 
not  reject  Ho  if  it  does  not. 
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The  Rejection  Region 

Each  different  form  of  the  alternative  hypothesis  Ha  has  its  own  kind  of  rejection 
region: 


1.  if  (as  in  the  respirator  example)  Ha  has  the  form  Ha  :  pi  <  pi(},  we  reject 
Ho  if  x  is  far  to  the  left  of  //() ,  that  is,  to  the  left  of  some  number  C,  so 
the  rejection  region  has  the  form  of  an  interval  (-°°,C]; 

2.  if  (as  in  the  textbook  example)  Ha  has  the  form  Ha  :  pi  >  //0,  we  reject 
Ho  ifx  is  far  to  the  right  of  f/()1  that  is,  to  the  right  of  some  number  C,  so 
the  rejection  region  has  the  form  of  an  interval  [C,°°); 

3.  if  (as  in  the  baked  good  example)  Ha  has  the  form  Ha  \  pi  ^  pi0,  we 
reject  Ho  ifx  is  far  away  from  pi{)  in  either  direction,  that  is,  either  to 
the  left  of  some  number  C  or  to  the  right  of  some  other  number  C',  so 
the  rejection  region  has  the  form  of  the  union  of  two  intervals 

(-°°,c]u[c',°°). 


The  key  issue  in  our  line  of  reasoning  is  the  question  of  how  to  determine  the 
number  C  or  numbers  C  and  C',  called  the  critical  value  or  critical  values  of  the 
statistic,  that  determine  the  rejection  region. 


Definition 

The  critical  value6  or  critical  values  of  a  test  of  hypotheses  are  the  number  or 
numbers  that  determine  the  rejection  region. 


6.  The  number  or  one  of  a  pair  of 
numbers  that  determines  the 
rejection  region. 


Suppose  the  rejection  region  is  a  single  interval,  so  we  need  to  select  a  single 
number  C.  Here  is  the  procedure  for  doing  so.  We  select  a  small  probability,  denoted 
a,  say  1%,  which  we  take  as  our  definition  of  “rare  event:”  an  event  is  “rare”  if  its 
probability  of  occurrence  is  less  than  a.  (in  all  the  examples  and  problems  in  this 
text  the  value  of  a  will  be  given  already.)  The  probability  that  X  takes  a  value  in  an 
interval  is  the  area  under  its  density  curve  and  above  that  interval,  so  as  shown  in 
Figure  8,2  (drawn  under  the  assumption  that  Ho  is  true,  so  that  the  curve  centers  at 
//q  )  the  critical  value  C  is  the  value  of  X  that  cuts  off  a  tail  area  a  in  the  probability 
density  curve  of  X .  When  the  rejection  region  is  in  two  pieces,  that  is,  composed  of 
two  intervals,  the  total  area  above  both  of  them  must  be  a,  so  the  area  above  each 
one  is  a  /  2,  as  also  shown  in  Figure  8,2. 
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Figure  8.2 


Ha  :  fi  *  pa 


The  number  a  is  the  total  area  of  a  tail  or  a  pair  oftaib. 


8.1  The  Elements  of  Hypothesis  Testing 


379 


Chapter  8  Testing  Hypotheses 


EXAMPLE  3 


In  the  context  of  Note  8.9  "Example  2",  suppose  that  it  is  known  that  the 
population  is  normally  distributed  with  standard  deviation  a  =  0.15  gram, 
and  suppose  that  the  test  of  hypotheses  Hq  \  jl  =  8.0  versus 
Ha  :  fl  7^  8.0  will  be  performed  with  a  sample  of  size  5.  Construct  the 
rejection  region  for  the  test  for  the  choice  OC  =  0.10.  Explain  the  decision 
procedure  and  interpret  it. 

Solution: 

If  Ho  is  true  then  the  sample  mean  X  is  normally  distributed  with  mean  and 
standard  deviation 

=  //  =  8.0,  <7%  =  <?  /  \fn  =  -  =  0.067 


Since  Ha  contains  the  *  symbol  the  rejection  region  will  be  in  two  pieces, 
each  one  corresponding  to  a  tail  of  area  a  /  2  =  0.10  /  2  =  0.05. 

From  Figure  12.3  "Critical  Values  of",  Zo.05  —  1 .645  ,  so  C  and  C  are  1.645 
standard  deviations  of  X  to  the  right  and  left  of  its  mean  8.0: 

C  =  8.0  -  (1.645)(0.067)  =  7.89  and  C’  =  8.0  +  (l.645)(0.067)  =  8.11 

The  result  is  shown  in  Figure  8.3  "Rejection  Region  for  the  Choice  ". 


Figure  8.3 

Rejection  Region  for 
the  Choice  a  =  0. 10 

H,  :  (i  +  8.0 


The  decision  procedure  is:  take  a  sample  of  size  5  and  compute  the  sample 
mean  X .  if  X  is  either  7.89  grams  or  less  or  8.11  grams  or  more  then  reject 
the  hypothesis  that  the  average  amount  of  fat  in  all  servings  of  the  product 


8.1  The  Elements  of  Hypothesis  Testing 


380 


Chapter  8  Testing  Hypotheses 


is  8.0  grams  in  favor  of  the  alternative  that  it  is  different  from  8.0  grams. 
Otherwise  do  not  reject  the  hypothesis  that  the  average  amount  is  8.0  grams. 

The  reasoning  is  that  if  the  true  average  amount  of  fat  per  serving  were  8.0 
grams  then  there  would  be  less  than  a  10%  chance  that  a  sample  of  size  5 
would  produce  a  mean  of  either  7.89  grams  or  less  or  8.11  grams  or  more. 
Hence  if  that  happened  it  would  be  more  likely  that  the  value  8.0  is  incorrect 
(always  assuming  that  the  population  standard  deviation  is  0.15  gram). 


Because  the  rejection  regions  are  computed  based  on  areas  in  tails  of  distributions, 
as  shown  in  Figure  8.2,  hypothesis  tests  are  classified  according  to  the  form  of  the 
alternative  hypothesis  in  the  following  way. 


Definition 

lfHa  has  the  form  //  7^  //0  the  test  is  called  a  two-tailed  test. 
If  Ha  has  the  form  ^  <  //0  the  test  is  called  a  left-tailed  test. 
lfHa  has  the  form  //  >  /i()  the  test  is  called  a  right-tailed  test. 
Each  of  the  last  two  forms  is  also  called  a  one-tailed  test. 


Two  Types  of  Errors 

The  format  of  the  testing  procedure  in  general  terms  is  to  take  a  sample  and  use  the 
information  it  contains  to  come  to  a  decision  about  the  two  hypotheses.  As  stated 
before  our  decision  will  always  be  either 

1.  reject  the  null  hypothesis  Ho  in  favor  of  the  alternative  Ha  presented, 
or 

2.  do  not  reject  the  null  hypothesis  Ho  in  favor  of  the  alternative  Ha 
presented. 
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There  are  four  possible  outcomes  of  hypothesis  testing  procedure,  as  shown  in  the 
following  table: 


True  State  of  Nature 

Ho  is  true 

Ho  is  false 

Our  Decision 

Do  not  reject  Ho 

Correct  decision 

Type  II  error 

Reject  H  o 

Type  I  error 

Correct  decision 

As  the  table  shows,  there  are  two  ways  to  be  right  and  two  ways  to  be  wrong. 
Typically  to  reject  Ho  when  it  is  actually  true  is  a  more  serious  error  than  to  fail  to 
reject  it  when  it  is  false,  so  the  former  error  is  labeled  “Type  I”  and  the  latter  error 
“Type  II.” 


Definition 

In  a  test  of  hypotheses,  a  Type  I  error7  is  the  decision  to  reject  Ho  when  it  is  in  fact 
true.  A  Type  II  error8  is  the  decision  not  to  reject  Ho  when  it  is  in  fact  not  true. 


Unless  we  perform  a  census  we  do  not  have  certain  knowledge,  so  we  do  not  know 
whether  our  decision  matches  the  true  state  of  nature  or  if  we  have  made  an  error. 
We  reject  Ho  if  what  we  observe  would  be  a  “rare”  event  if  Ho  were  true.  But  rare 
events  are  not  impossible:  they  occur  with  probability  a.  Thus  when  Ho  is  true,  a 
rare  event  will  be  observed  in  the  proportion  a  of  repeated  similar  tests,  and  Ho  will 
be  erroneously  rejected  in  those  tests.  Thus  a  is  the  probability  that  in  following 
the  testing  procedure  to  decide  between  Ho  and  Ha  we  will  make  a  Type  I  error. 


7.  Rejection  of  a  true  null 
hypothesis. 

8.  Failure  to  reject  a  false  null 
hypothesis. 

9.  The  probability  a  that  defines 
an  event  as  “rare;”  the 
probability  that  the  test 
procedure  will  lead  to  a  Type  I 
error. 


Definition 

The  number  a  that  is  used  to  determine  the  rejection  region  is  called  the  level  of 
significance  of  the  test9.  It  is  the  probability  that  the  test  procedure  will  result  in  a 
Type  I  error. 
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The  probability  of  making  a  Type  II  error  is  too  complicated  to  discuss  in  a 
beginning  text,  so  we  will  say  no  more  about  it  than  this:  for  a  fixed  sample  size, 
choosing  a  smaller  in  order  to  reduce  the  chance  of  making  a  Type  I  error  has  the 
effect  of  increasing  the  chance  of  making  a  Type  II  error.  The  only  way  to 
simultaneously  reduce  the  chances  of  making  either  kind  of  error  is  to  increase  the 
sample  size. 

Standardizing  the  Test  Statistic 

Hypotheses  testing  will  be  considered  in  a  number  of  contexts,  and  great 
unification  as  well  as  simplification  results  when  the  relevant  sample  statistic  is 
standardized  by  subtracting  its  mean  from  it  and  then  dividing  by  its  standard 
deviation.  The  resulting  statistic  is  called  a  standardized  test  statistic.  In  every 
situation  treated  in  this  and  the  following  two  chapters  the  standardized  test 
statistic  will  have  either  the  standard  normal  distribution  or  Student’s  t- 
distribution. 


Definition 

A  standardized  test  statistic10  for  a  hypothesis  test  is  the  statistic  that  is  formed  hy 
subtracting  from  the  statistic  of  interest  its  mean  and  dividing  by  its  standard 
deviation. 


For  example,  reviewing  Note  8,14  "Example  3",  if  instead  of  working  with  the 
sample  mean  X  we  instead  work  with  the  test  statistic 

X— 8.0 
0.067 

then  the  distribution  involved  is  standard  normal  and  the  critical  values  are  just 
±Z0.05  ■  The  extra  work  that  was  done  to  find  that  C  =  7.89  and  C'  —  8. 1 1  is 
eliminated.  In  every  hypothesis  test  in  this  book  the  standardized  test  statistic  will 
be  governed  by  either  the  standard  normal  distribution  or  Student’s  t-distribution. 
Information  about  rejection  regions  is  summarized  in  the  following  tables: 

When  the  test  statistic  has  the  standard  normal  distribution: 

10.  The  standardized  statistic  used  Symbol  in  Ha  Terminology  Rejection  Region 

in  performing  the  test.  - - - 


8.1  The  Elements  of  Hypothesis  Testing 


383 


Chapter  8  Testing  Hypotheses 


< 

Left-tailed  test 

(  °°?  Za\ 

> 

Right-tailed  test 

\z.a->  00 ) 

* 

Two-tailed  test 

(-co,  -Za/l]  U  \Za/h  00 ) 

When  the  test  statistic  has  Student’s  t-distribution: 

Symbol  in  Ha 

Terminology 

Rejection  Region 

< 

Left-tailed  test 

(-oo,  -ta] 

> 

Right-tailed  test 

[ta,  °°) 

* 

Two-tailed  test 

(  00  >  fyr/2 ]  U  [f«/2?  00 ) 

Every  instance  of  hypothesis  testing  discussed  in  this  and  the  following  two 
chapters  will  have  a  rejection  region  like  one  of  the  six  forms  tabulated  in  the  tables 
above. 


No  matter  what  the  context  a  test  of  hypotheses  can  always  be  performed  by 
applying  the  following  systematic  procedure,  which  will  be  illustrated  in  the 
examples  in  the  succeeding  sections. 


Systematic  Hypothesis  Testing  Procedure:  Critical  Value 
Approach 

1.  Identify  the  null  and  alternative  hypotheses. 

2.  Identify  the  relevant  test  statistic  and  its  distribution. 

3.  Compute  from  the  data  the  value  of  the  test  statistic. 

4.  Construct  the  rejection  region. 

5.  Compare  the  value  computed  in  Step  3  to  the  rejection  region 
constructed  in  Step  4  and  make  a  decision.  Formulate  the  decision 
in  the  context  of  the  problem,  if  applicable. 


The  procedure  that  we  have  outlined  in  this  section  is  called  the  “Critical  Value 
Approach”  to  hypothesis  testing  to  distinguish  it  from  an  alternative  but  equivalent 
approach  that  will  be  introduced  at  the  end  of  Section  8.3  "The  Observed 
Significance  of  a  Test". 
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KEY  TAKEAWAYS 


•  A  test  of  hypotheses  is  a  statistical  process  for  deciding  between  two 
competing  assertions  about  a  population  parameter. 

•  The  testing  procedure  is  formalized  in  a  five-step  procedure. 
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EXERCISES 


1.  State  the  null  and  alternative  hypotheses  for  each  of  the  following  situations. 

(That  is,  identify  the  correct  number  /Uq  and  write  Hq  /./  =  Hq  and  the 

appropriate  analogous  expression  for  Ha.) 

a.  The  average  July  temperature  in  a  region  historically  has  been  74.5°F. 
Perhaps  it  is  higher  now. 

b.  The  average  weight  of  a  female  airline  passenger  with  luggage  was  145 
pounds  ten  years  ago.  The  FAA  believes  it  to  be  higher  now. 

c.  The  average  stipend  for  doctoral  students  in  a  particular  discipline  at  a 
state  university  is  $14,756.  The  department  chairman  believes  that  the 
national  average  is  higher. 

d.  The  average  room  rate  in  hotels  in  a  certain  region  is  $82.53.  A  travel  agent 
believes  that  the  average  in  a  particular  resort  area  is  different. 

e.  The  average  farm  size  in  a  predominately  rural  state  was  69.4  acres.  The 
secretary  of  agriculture  of  that  state  asserts  that  it  is  less  today. 

2.  State  the  null  and  alternative  hypotheses  for  each  of  the  following  situations. 

(That  is,  identify  the  correct  number  jUq  and  write  Hq  l  [A  =  /l0and  the 

appropriate  analogous  expression  for  Ha .) 

a.  The  average  time  workers  spent  commuting  to  work  in  Verona  five  years 
ago  was  38.2  minutes.  The  Verona  Chamber  of  Commerce  asserts  that  the 
average  is  less  now. 

b.  The  mean  salary  for  all  men  in  a  certain  profession  is  $58,291.  A  special 
interest  group  thinks  that  the  mean  salary  for  women  in  the  same 
profession  is  different. 

c.  The  accepted  figure  for  the  caffeine  content  of  an  8-ounce  cup  of  coffee  is 
133  mg.  A  dietitian  believes  that  the  average  for  coffee  served  in  a  local 
restaurants  is  higher. 

d.  The  average  yield  per  acre  for  all  types  of  corn  in  a  recent  year  was  161.9 
bushels.  An  economist  believes  that  the  average  yield  per  acre  is  different 
this  year. 

e.  An  industry  association  asserts  that  the  average  age  of  all  self-described  fly 
fishermen  is  42.8  years.  A  sociologist  suspects  that  it  is  higher. 

3.  Describe  the  two  types  of  errors  that  can  be  made  in  a  test  of  hypotheses. 

4.  Under  what  circumstance  is  a  test  of  hypotheses  certain  to  yield  a  correct 

decision? 
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ANSWERS 


i.  a.  Hq  :  fi  =  74.5  vs.  Ha  :  //  >  74.5 

b.  /7o  :  ^  —  145  vs.//fl  :  ii  >  145 

c.  Ho  :  n  =  14756  vs.  Ha  :  n  >  14756 

d.  H0  :  ii  =  82.53  vs.  Ha  :  n  ±  82.53 

e.  Ho  :  n  =  69.4  vs.  Ha  :  [i  <  69.4 

3.  A  Type  I  error  is  made  when  a  true  Ho  is  rejected.  A  Type  II  error  is  made  when 
a  false  Hq  is  not  rejected. 
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8.2  Large  Sample  Tests  for  a  Population  Mean 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  apply  the  five-step  test  procedure  for  a  test  of 
hypotheses  concerning  a  population  mean  when  the  sample  size  is  large. 

2.  To  learn  how  to  interpret  the  result  of  a  test  of  hypotheses  in  the 
context  of  the  original  narrated  situation. 


In  this  section  we  describe  and  demonstrate  the  procedure  for  conducting  a  test  of 
hypotheses  about  the  mean  of  a  population  in  the  case  that  the  sample  size  n  is  at 
least  30.  The  Central  Limit  Theorem  states  that  X  is  approximately  normally 
distributed,  and  has  mean  /a j  =  /a  and  standard  deviation  rrj  =  a  j  yjn,  where  /j 
and  <7  are  the  mean  and  the  standard  deviation  of  the  population.  This  implies  that 
the  statistic 


X  —  /A 


has  the  standard  normal  distribution,  which  means  that  probabilities  related  to  it 
are  given  in  Figure  12.2  "Cumulative  Normal  Probability"  and  the  last  line  in  Figure 
12.3  "Critical  Values  of". 


if  we  know  a  then  the  statistic  in  the  display  is  our  test  statistic,  if,  as  is  typically 
the  case,  we  do  not  know  <7,  then  we  replace  it  by  the  sample  standard  deviation  s. 
Since  the  sample  is  large  the  resulting  test  statistic  still  has  a  distribution  that  is 
approximately  standard  normal. 
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Standardized  Test  Statistics  for  Large  Sample  Hypothesis 
Tests  Concerning  a  Single  Population  Mean 


If  cr  is  known:  Z  = 


If  g  is  unknown:  Z  = 


*  -ftp 

°  /  yfi 

x  -v0 


The  test  statistic  has  the  standard  normal  distribution. 


The  distribution  of  the  standardized  test  statistic  and  the  corresponding  rejection 
region  for  each  form  of  the  alternative  hypothesis  (left-tailed,  right-tailed,  or  two- 
tailed),  is  shown  in  Figure  8.4  "Distribution  of  the  Standardized  Test  Statistic  and 
the  Rejection  Region". 


Figure  8.4  Distribution  of  the  Standardized  Test  Statistic  and  the  Rejection  Region 
Ha  :  ji<  jto  Ha  : 


Ha  :  ]*  * 


Reject  Hq  J  0  L  Reject  Hq 
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EXAMPLE  4 


It  is  hoped  that  a  newly  developed  pain  reliever  will  more  quickly  produce 
perceptible  reduction  in  pain  to  patients  after  minor  surgeries  than  a 
standard  pain  reliever.  The  standard  pain  reliever  is  known  to  bring  relief  in 
an  average  of  3.5  minutes  with  standard  deviation  2.1  minutes.  To  test 
whether  the  new  pain  reliever  works  more  quickly  than  the  standard  one,  50 
patients  with  minor  surgeries  were  given  the  new  pain  reliever  and  their 
times  to  relief  were  recorded.  The  experiment  yielded  sample  mean 
X  =  3.1  minutes  and  sample  standard  deviation  s  =  1.5  minutes.  Is  there 
sufficient  evidence  in  the  sample  to  indicate,  at  the  5%  level  of  significance, 
that  the  newly  developed  pain  reliever  does  deliver  perceptible  relief  more 
quickly? 

Solution: 

We  perform  the  test  of  hypotheses  using  the  five-step  procedure  given  at 
the  end  of  Section  8.1  "The  Elements  of  Hypothesis  Testing". 


•  Step  1.  The  natural  assumption  is  that  the  new  drug  is  no  better 
than  the  old  one,  but  must  be  proved  to  be  better.  Thus  if  /j 
denotes  the  average  time  until  all  patients  who  are  given  the  new 
drug  experience  pain  relief,  the  hypothesis  test  is 

Hq  :  n  —  3.5 

vs .Ha  :  //  <  3.5  @  a  =  0.05 

•  Step  2.  The  sample  is  large,  but  the  population  standard 
deviation  is  unknown  (the  2.1  minutes  pertains  to  the  old  drug, 
not  the  new  one).  Thus  the  test  statistic  is 


S  /  yjn 

and  has  the  standard  normal  distribution. 

•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 
gives 
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z  = 


x  -  [i0  3.1  -  3.5 


=  -1.886 


•  Step  4.  Since  the  symbol  in  Ha  is  “<”  this  is  a  left-tailed  test,  so  there  is  a 
single  critical  value,  —  Za  —  — Zo.05 ,  which  from  the  last  line  in  Figure 
12.3  "Critical  Values  of  "  we  read  off  as  -1.645.  The  rejection  region  is 


(-oo,-1.645]  . 


•  Step  5.  As  shown  in  Figure  8.5  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  the  average  time  until  patients 
experience  perceptible  relief  from  pain  using  the  new  pain 
reliever  is  smaller  than  the  average  time  for  the  standard  pain 
reliever. 

Figure  8.5 

Rejection  Region  and 
Test  Statistic  fo  Note 
8.27  "Examvle4" 


Ha  :  n  <  3.5 


Reject  Ho  jJ 

Z  = -1.886 
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A  cosmetics  company  fills  its  best-selling  8-ounce  jars  of  facial  cream  by  an 
automatic  dispensing  machine.  The  machine  is  set  to  dispense  a  mean  of  8.1 
ounces  per  jar.  Uncontrollable  factors  in  the  process  can  shift  the  mean 
away  from  8.1  and  cause  either  underfill  or  overfill,  both  of  which  are 
undesirable.  In  such  a  case  the  dispensing  machine  is  stopped  and 
recalibrated.  Regardless  of  the  mean  amount  dispensed,  the  standard 
deviation  of  the  amount  dispensed  always  has  value  0.22  ounce.  A  quality 
control  engineer  routinely  selects  30  jars  from  the  assembly  line  to  check 
the  amounts  filled.  On  one  occasion,  the  sample  mean  is  X  =  8.2  ounces 
and  the  sample  standard  deviation  is  s  =  0.25  ounce.  Determine  if  there  is 
sufficient  evidence  in  the  sample  to  indicate,  at  the  1%  level  of  significance, 
that  the  machine  should  be  recalibrated. 


Solution: 


•  Step  1.  The  natural  assumption  is  that  the  machine  is  working 
properly.  Thus  if  /i  denotes  the  mean  amount  of  facial  cream 
being  dispensed,  the  hypothesis  test  is 


H0  :  n  =  8.1 

vs .Ha\n  ±  8.1  @a  =  0.01 


•  Step  2.  The  sample  is  large  and  the  population  standard  deviation 
is  known.  Thus  the  test  statistic  is 


z  = 


X  -1*0 


and  has  the  standard  normal  distribution. 


•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 
gives 


z  = 


x  -  /Jq  8.2  -  8.1 


=  2.490 
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•  Step  4.  Since  the  symbol  in  Ha  is  this  is  a  two-tailed  test,  so  there  are 
two  critical  values,  ±Za/ 2  —  0.005 ,  which  from  the  last  line  in 

Figure  12.3  "Critical  Values  of"  we  read  off  as  ±2.576.  The  rejection 
region  is  ( —  co,  —  2.576]  U  [2.576,  «>)  . 

•  Step  5.  As  shown  in  Figure  8.6  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  does  not  fall  in  the  rejection 
region.  The  decision  is  not  to  reject  Ho.  In  the  context  of  the 
problem  our  conclusion  is: 

The  data  do  not  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  average  amount  of  product 
dispensed  is  different  from  8.1  ounce.  We  conclude  that  the 
machine  does  not  need  to  be  recalibrated. 

Figure  8.6 

Rejection  Region  and 
Test  Statistic  fo  Note 
8.28  "Exanwle  5" 

Ha  :  fi^  8.1 


KEY  TAKEAWAYS 


•  There  are  two  formulas  for  the  test  statistic  in  testing  hypotheses  about 
a  population  mean  with  large  samples.  Both  test  statistics  follow  the 
standard  normal  distribution. 

•  The  population  standard  deviation  is  used  if  it  is  known,  otherwise  the 
sample  standard  deviation  is  used. 

•  The  same  five-step  procedure  is  used  with  either  test  statistic. 
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1.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test. 

a.  Hq  :  fi  =  27 vs.  Ha  :  fi  <  27  @  a  =  0.05. 

b.  Hq  :  fA  =  52 vs ,Ha  :  /a  ±  52@a  =  0.05. 

c.  Hq  :  /a  =  — 105  vs.  Ha  :  /a  >  — 105  @  a  =  0.10. 

d.  H0  :  n  =  78.8  vs ,Ha  :  \a  ±  78.8 @ a  =  0.10. 

2.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test. 

a.  Hq  :  /a  =  1 7  vs.  Ha  :  //  <  17  @  a  =  0.01. 

b.  H0  :  n  =  880vs .Ha  :  fi  +  880@a  =  0.01. 

c.  77o  :  ^  —  — 12 vs.  :  //  >  — 12 @  a  =  0.05. 

d.  Hq  :  [a  =  21.1  vs ,Ha  :  ^  ^  21. 1@ a  =  0.05. 

3.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test.  Identify  the  test  as  left-tailed,  right-tailed,  or  two-tailed. 

a.  Hq  :  fA  =  141  vs. Ha  :  /a  <  141  @a  =  0.20. 

b.  Hq  :  [A  =  —  54 vs.  Ha  :  fA  <  —  54 @  a  =  0.05. 

c.  Hq  :  fA  =  98.6  vs.  Ha  :  //  ^  98.6  @  a  =  0.05. 

d.  /7o  i  /a  =  3.8 vs ,Ha  :  ^  >  3.8  @  a  =  0.001. 

4.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test.  Identify  the  test  as  left-tailed,  right-tailed,  or  two-tailed. 

a.  Hq  :  fA  =  — 62vs ,Ha  :  /a  ±  — 62@a  =  0.005. 

b.  Hq  :  fA  =  73  vs ,Ha  \  /a  >  73  @  a  =  0.001. 

c.  Hq  :  fi  —  1124  vs.//a  :  fi  <  1124@a  =  0.001. 

d.  H0  :  fA  =  0.12 vs. Ha  :  fA  ±  0.12@a  =  0.001. 

5.  Compute  the  value  of  the  test  statistic  for  the  indicated  test,  based  on  the 
information  given. 

a.  Testing  Hq  \  fA  =  72.2  vs.  Ha  fA  >  72.2,  o  unknown,  n  =  55, 

X  =  75.1,  s  =  9.25 

b.  Testing  Hq  fl  =  58  vs ,Ha  fA  >  58,  o=  1.22,  n  =  40,  X  =  58.5,  s  = 
1.29 

c.  Testing  Hq  fl  =  — 19. 5  vs.  Ha  fA  <  — 19.5,  a  unknown,  n  =  30, 

X  =  —  23.2,  s  =  9.55 
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d.  Testing  Hq  \  fi  =  805  vs.  Ha  \  /A  ±  805,  <7  =  37.5,  n  =  75,  X  =  818,s 
=  36.2 

6.  Compute  the  value  of  the  test  statistic  for  the  indicated  test,  based  on  the 
information  given. 

a.  Testing  Hq  fi  =  342  vs.  Ha  fA  <  342,  <7  =  11.2,  n  =  40,  X  =  3  39,  s 
=  10.3 

b.  Testing  Hq  :  fA  =  105  vs.  Ha  fA  >  105,  <7=  5.3,  n  =  80,  X  =  107,  s  = 

5.1 

c.  Testing/Tp  •  jW  =  — 13.5  vs.  Ha  \  [A  7^  — 13.5,  cxunknown,  n  =  32, 
x  —  -13.8,s  =  i.5 

d.  Testing  Hq  fA  =  28  vs  ,Ha  fA  7^  28,  o  unknown,  n  =  68,  X  =  27. 8, 
s  =  1.3 

7.  Perform  the  indicated  test  of  hypotheses,  based  on  the  information  given. 

a.  JestHo  :  /A  =  212vs  ,Ha  :  fA  <  212  @a  =  0.10,  o  unknown,  n  = 
36,  X  =  211.2,5  =  2.2 

b.  Test/fo  :  M  =  ~  18  vs.  Ha  :  fA  >  — 18  @  a  =  0.05,  CT  =3.3,  n  =  44, 
x  =  — 17.2,s  =  3.i 

c.  Test//o  •  fl  —  24  vs.  Ha  :  fA  7^  24@  a  =  0.02,  o  unknown,  n  =  50, 
x  =  22.8,  s  =  1.9 

8.  Perform  the  indicated  test  of  hypotheses,  based  on  the  information  given. 

a.  Test  Hq  :  fA  =  105  vs ,Ha  :  fA  >  105  @  a  =  0.05,  o  unknown,  n  = 

30,  X  =  108,  s  =  7.2 

b.  Test//o  :  ^  —  21.6  vs.  Ha  :  fA  <  21.6@  «  =  0.01,  o  unknown,  n  = 
78,  X  =  20.5,  s  =  3.9 

c.  Test//o  i  ^  =  — 375  vs.  Ha  :  fA  7^  — 375(a)  a  =  0.01,a=18.5,  n  = 

31,  X  —  -388,  s=  18.0 


APPLICATIONS 


9.  In  the  past  the  average  length  of  an  outgoing  telephone  call  from  a  business 
office  has  been  143  seconds.  A  manager  wishes  to  check  whether  that  average 
has  decreased  after  the  introduction  of  policy  changes.  A  sample  of  100 
telephone  calls  produced  a  mean  of  133  seconds,  with  a  standard  deviation  of 
35  seconds.  Perform  the  relevant  test  at  the  1%  level  of  significance. 

10.  The  government  of  an  impoverished  country  reports  the  mean  age  at  death 
among  those  who  have  survived  to  adulthood  as  66.2  years.  A  relief  agency 
examines  30  randomly  selected  deaths  and  obtains  a  mean  of  62.3  years  with 
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standard  deviation  8.1  years.  Test  whether  the  agency’s  data  support  the 
alternative  hypothesis,  at  the  1%  level  of  significance,  that  the  population 
mean  is  less  than  66.2. 

11.  The  average  household  size  in  a  certain  region  several  years  ago  was  3.14 
persons.  A  sociologist  wishes  to  test,  at  the  5%  level  of  significance,  whether  it 
is  different  now.  Perform  the  test  using  the  information  collected  by  the 
sociologist:  in  a  random  sample  of  75  households,  the  average  size  was  2.98 
persons,  with  sample  standard  deviation  0.82  person. 

12.  The  recommended  daily  calorie  intake  for  teenage  girls  is  2,200  calories/ day.  A 
nutritionist  at  a  state  university  believes  the  average  daily  caloric  intake  of 
girls  in  that  state  to  be  lower.  Test  that  hypothesis,  at  the  5%  level  of 
significance,  against  the  null  hypothesis  that  the  population  average  is  2,200 
calories/day  using  the  following  sample  data:  n  =  36,  X  =  2,150,  s  =  203. 

13.  An  automobile  manufacturer  recommends  oil  change  intervals  of  3,000  miles. 
To  compare  actual  intervals  to  the  recommendation,  the  company  randomly 
samples  records  of  50  oil  changes  at  service  facilities  and  obtains  sample  mean 
3,752  miles  with  sample  standard  deviation  638  miles.  Determine  whether  the 
data  provide  sufficient  evidence,  at  the  5%  level  of  significance,  that  the 
population  mean  interval  between  oil  changes  exceeds  3,000  miles. 

14.  A  medical  laboratory  claims  that  the  mean  turn-around  time  for  performance 
of  a  battery  of  tests  on  blood  samples  is  1.88  business  days.  The  manager  of  a 
large  medical  practice  believes  that  the  actual  mean  is  larger.  A  random 
sample  of  45  blood  samples  yielded  mean  2.09  and  sample  standard  deviation 
0.13  day.  Perform  the  relevant  test  at  the  10%  level  of  significance,  using  these 
data. 

15.  A  grocery  store  chain  has  as  one  standard  of  service  that  the  mean  time 
customers  wait  in  line  to  begin  checking  out  not  exceed  2  minutes.  To  verify 
the  performance  of  a  store  the  company  measures  the  waiting  time  in  30 
instances,  obtaining  mean  time  2.17  minutes  with  standard  deviation  0.46 
minute.  Use  these  data  to  test  the  null  hypothesis  that  the  mean  waiting  time 

is  2  minutes  versus  the  alternative  that  it  exceeds  2  minutes,  at  the  10%  level  of 
significance. 

16.  A  magazine  publisher  tells  potential  advertisers  that  the  mean  household 
income  of  its  regular  readership  is  $61,500.  An  advertising  agency  wishes  to 
test  this  claim  against  the  alternative  that  the  mean  is  smaller.  A  sample  of  40 
randomly  selected  regular  readers  yields  mean  income  $59,800  with  standard 
deviation  $5,850.  Perform  the  relevant  test  at  the  1%  level  of  significance. 

17.  Authors  of  a  computer  algebra  system  wish  to  compare  the  speed  of  a  new 
computational  algorithm  to  the  currently  implemented  algorithm.  They  apply 
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the  new  algorithm  to  50  standard  problems;  it  averages  8.16  seconds  with 
standard  deviation  0.17  second.  The  current  algorithm  averages  8.21  seconds 
on  such  problems.  Test,  at  the  1%  level  of  significance,  the  alternative 
hypothesis  that  the  new  algorithm  has  a  lower  average  time  than  the  current 
algorithm. 

18.  A  random  sample  of  the  starting  salaries  of  35  randomly  selected  graduates 
with  bachelor’s  degrees  last  year  gave  sample  mean  and  standard  deviation 
$41,202  and  $7,621,  respectively.  Test  whether  the  data  provide  sufficient 
evidence,  at  the  5%  level  of  significance,  to  conclude  that  the  mean  starting 
salary  of  all  graduates  last  year  is  less  than  the  mean  of  all  graduates  two  years 
before,  $43,589. 


ADDITIONAL  EXERCISES 


19.  The  mean  household  income  in  a  region  served  by  a  chain  of  clothing  stores  is 
$48,750.  In  a  sample  of  40  customers  taken  at  various  stores  the  mean  income 
of  the  customers  was  $51,505  with  standard  deviation  $6,852. 

a.  Test  at  the  10%  level  of  significance  the  null  hypothesis  that  the  mean 
household  income  of  customers  of  the  chain  is  $48,750  against  that 
alternative  that  it  is  different  from  $48,750. 

b.  The  sample  mean  is  greater  than  $48,750,  suggesting  that  the  actual  mean 
of  people  who  patronize  this  store  is  greater  than  $48,750.  Perform  this 
test,  also  at  the  10%  level  of  significance.  (The  computation  of  the  test 
statistic  done  in  part  (a)  still  applies  here.) 

20.  The  labor  charge  for  repairs  at  an  automobile  service  center  are  based  on  a 
standard  time  specified  for  each  type  of  repair.  The  time  specified  for 
replacement  of  universal  joint  in  a  drive  shaft  is  one  hour.  The  manager 
reviews  a  sample  of  30  such  repairs.  The  average  of  the  actual  repair  times  is 
0.86  hour  with  standard  deviation  0.32  hour. 

a.  Test  at  the  1%  level  of  significance  the  null  hypothesis  that  the  actual 
mean  time  for  this  repair  differs  from  one  hour. 

b.  The  sample  mean  is  less  than  one  hour,  suggesting  that  the  mean  actual 
time  for  this  repair  is  less  than  one  hour.  Perform  this  test,  also  at  the  1% 
level  of  significance.  (The  computation  of  the  test  statistic  done  in  part  (a) 
still  applies  here.) 
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LARGE  DATA  SET  EXERCISES 


21.  Large  Data  Set  1  records  the  SAT  scores  of  1,000  students.  Regarding  it  as  a 
random  sample  of  all  high  school  students,  use  it  to  test  the  hypothesis  that 
the  population  mean  exceeds  1,510,  at  the  1%  level  of  significance.  (The  null 
hypothesis  is  that  /i  =  1510.) 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

22.  Large  Data  Set  1  records  the  GPAs  of  1,000  college  students.  Regarding  it  as  a 
random  sample  of  all  college  students,  use  it  to  test  the  hypothesis  that  the 
population  mean  is  less  than  2.50,  at  the  10%  level  of  significance.  (The  null 
hypothesis  is  that  /i  =  2.50.) 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ datal.xls 

23.  Large  Data  Set  1  lists  the  SAT  scores  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  students  at  a  high  school,  in 
which  the  SAT  score  of  every  student  was  measured.  Compute  the 
population  mean  fi. 

b.  Regard  the  first  50  students  in  the  data  set  as  a  random  sample  drawn  from 
the  population  of  part  (a)  and  use  it  to  test  the  hypothesis  that  the 
population  mean  exceeds  1,510,  at  the  10%  level  of  significance.  (The  null 
hypothesis  is  that  fj.  =  1510.) 

c.  Is  your  conclusion  in  part  (b)  in  agreement  with  the  true  state  of  nature 
(which  by  part  (a)  you  know),  or  is  your  decision  in  error?  if  your  decision 
is  in  error,  is  it  a  Type  I  error  or  a  Type  II  error? 

24.  Large  Data  Set  1  lists  the  GPAs  of  1,000  students. 
http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Regard  the  data  as  arising  from  a  census  of  all  freshman  at  a  small  college 
at  the  end  of  their  first  academic  year  of  college  study,  in  which  the  GPA  of 
every  such  person  was  measured.  Compute  the  population  mean  /i. 

b.  Regard  the  first  50  students  in  the  data  set  as  a  random  sample  drawn  from 
the  population  of  part  (a)  and  use  it  to  test  the  hypothesis  that  the 
population  mean  is  less  than  2.50,  at  the  10%  level  of  significance.  (The  null 
hypothesis  is  that  fj.  =  2.50.) 

c.  Is  your  conclusion  in  part  (b)  in  agreement  with  the  true  state  of  nature 
(which  by  part  (a)  you  know),  or  is  your  decision  in  error?  if  your  decision 
is  in  error,  is  it  a  Type  I  error  or  a  Type  II  error? 
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ANSWERS 


i.  a.  Z  <  -1.645 

b.  Z  <  —1.96  or  Z  2  1.96 

c.  Z  2  1.28 

d.  Z  <  -1.645  or  Z  2  1.645 

3.  a.  Z  <  -0.84 

b.  Z  <  -1.645 

c.  Z  <  —1.96  or  Z  2  1.96 

d.  Z  2  3.1 

5.  a.  Z  =  2.235 

b.  Z  =  2.592 

c.  Z  =  -2.122 

d.  Z  =  3.002 

7.  a.  Z  =  —  2.18,—  Zo.10  —  -1-28,  reject  Ho. 

b.  Z  =  1.61,  Zo.05  —  1  -645 ,  do  not  reject  Ho. 

c.  Z  =  — 4.47,— Zo.01  —  —2.33 ,  reject  Ho. 

9.  Z  =  —  2.86,—  Zo.Ol  —  —2.33,  reject  Ho. 

11.  Z  =  — 1.69,  — Zo.025  —  — 1-96,  do  not  reject  Ho. 

13.  Z  =  8.33,  Zo.05  —  1-645 ,  reject  Ho. 

15.  Z  =  2.02,  Zo.10  —  1-28,  reject  Ho. 

17.  Z  =  —  2.08,—  Zo.01  —  —2.33 ,  do  not  reject  Ho. 

19.  a.  Z  =  2.54,  Zo.05  —  1 .645 ,  reject  Ho; 
b.  Z  =  2.54,  Zo.10  —  1.28,  reject  Ho. 

21.  H0  :  pi  =  1510  vs ,Ha  :  >  1510.  Test  Statistic:  Z  =  2.7882.  Rejection 

Region:  [2.33  ,  oo^  .  Decision:  Reject  Ho. 

23.  a.  flQ  =  1528.74 

b.  Ho  :  n  =  1510  vs.  Ha  :  /I  >  1510.  Test  Statistic:  Z  =  —1.41. 
Rejection  Region:  [l.28,oo)  .  Decision:  Fail  to  reject  Ho. 

c.  No,  it  is  a  Type  II  error. 


8.2  Large  Sample  Tests  for  a  Population  Mean 
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8.3  The  Observed  Significance  of  a  Test 


LEARNING  OBJECTIVES 

1.  To  learn  what  the  observed  significance  of  a  test  is. 

2.  To  learn  how  to  compute  the  observed  significance  of  a  test. 

3.  To  learn  how  to  apply  the  p-value  approach  to  hypothesis  testing. 


The  Observed  Significance 

The  conceptual  basis  of  our  testing  procedure  is  that  we  reject  Ho  only  if  the  data 
that  we  obtained  would  constitute  a  rare  event  if  Ho  were  actually  true.  The  level  of 
significance  a  specifies  what  is  meant  by  “rare.”  The  observed  significance  of  the  test 
is  a  measure  of  how  rare  the  value  of  the  test  statistic  that  we  have  just  observed 
would  be  if  the  null  hypothesis  were  true.  That  is,  the  observed  significance  of  the  test 
just  performed  is  the  probability  that,  if  the  test  were  repeated  with  a  new  sample, 
the  result  of  the  new  test  would  be  at  least  as  contrary  to  Ho  and  in  support  of  Ha  as 
what  was  observed  in  the  original  test. 


Definition 

The  observed  significance  or  p-value11  of  a  specific  test  of  hypotheses  is  the 
probability,  on  the  supposition  that  Ho  is  true,  of  obtaining  a  result  at  least  as  contrary 
to  Ho  and  in  favor  ofHa  as  the  result  actually  observed  in  the  sample  data. 


11.  The  probability,  if  Ho  is  true,  of 
obtaining  a  result  as  contrary 
to  Ho  and  in  favor  of  Ha  as  the 
result  observed  in  the  sample 
data. 


Think  back  to  Note  8.27  "Example  4"  in  Section  8.2  "Large  Sample  Tests  for  a 
Population  Mean"  concerning  the  effectiveness  of  a  new  pain  reliever.  This  was  a 
left-tailed  test  in  which  the  value  of  the  test  statistic  was  -1.886.  To  be  as  contrary 
to  Ho  and  in  support  of  Ha  as  the  result  Z  =  —  1 .886  actually  observed  means  to 

obtain  a  value  of  the  test  statistic  in  the  interval  (— oo,  —  1 .886]  .  Rounding  -1.886 
to  -1.89,  we  can  read  directly  from  Figure  12.2  "Cumulative  Normal  Probability" 
that  P  (Z  <  —1.89)  =  0.0294. Thus  the  p-value  or  observed  significance  of  the 
test  in  Note  8.27  "Example  4"  is  0.0294  or  about  3%.  Under  repeated  sampling  from 
this  population,  if  Ho  were  true  then  only  about  3%  of  all  samples  of  size  50  would 
give  a  result  as  contrary  to  Hq  and  in  favor  of  Ha  as  the  sample  we  observed.  Note 
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that  the  probability  0.0294  is  the  area  of  the  left  tail  cut  off  by  the  test  statistic  in 
this  left-tailed  test. 


Analogous  reasoning  applies  to  a  right-tailed  or  a  two-tailed  test,  except  that  in  the 
case  of  a  two-tailed  test  being  as  far  from  0  as  the  observed  value  of  the  test  statistic 
but  on  the  opposite  side  of  0  is  just  as  contrary  to  Ho  as  being  the  same  distance 
away  and  on  the  same  side  of  0,  hence  the  corresponding  tail  area  is  doubled. 


Computational  Definition  of  the  Observed  Significance  of 
a  Test  of  Hypotheses 

The  observed  significance  of  a  test  of  hypotheses  is  the  area  of  the  tail  of  the 
distribution  cut  off  by  the  test  statistic  (times  two  in  the  case  of  a  two-tailed 
test). 


EXAMPLE  6 


Compute  the  observed  significance  of  the  test  performed  in  Note  8.28 
"Example  5"  in  Section  8.2  "Large  Sample  Tests  for  a  Population  Mean". 

Solution: 

The  value  of  the  test  statistic  was  z  =  2.490,  which  by  Figure  12.2  "Cumulative 
Normal  Probability"  cuts  off  a  tail  of  area  0.0064,  as  shown  in  Figure  8.7 
"Area  of  the  Tail  for  ".  Since  the  test  was  two-tailed,  the  observed 
significance  is  2  X  0.0064  =  0.0128. 


Figure  8.7 

Area  of  the  Tail  for 

Note  8.34  "Exanwle  6" 

Ha  :  8.1 


area  —  0.0064 
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The  p-value  Approach  to  Hypothesis  Testing 

In  Note  8.27  "Example  4"  in  Section  8.2  "Large  Sample  Tests  for  a  Population  Mean" 
the  test  was  performed  at  the  5%  level  of  significance:  the  definition  of  “rare”  event 
was  probability  a  —  0.05  or  less.  We  saw  above  that  the  observed  significance  of 
the  test  was  p  =  0.0294  or  about  3%.  Since  p  —  0.0294  <  0.05  =  a  (or  3%  is  less 
than  5%),  the  decision  turned  out  to  be  to  reject:  what  was  observed  was  sufficiently 
unlikely  to  qualify  as  an  event  so  rare  as  to  be  regarded  as  (practically) 
incompatible  with  Ho. 


In  Note  8.28  "Example  5"  in  Section  8.2  "Large  Sample  Tests  for  a  Population  Mean" 
the  test  was  performed  at  the  1%  level  of  significance:  the  definition  of  “rare”  event 
was  probability  a  —  0.01  or  less.  The  observed  significance  of  the  test  was 
computed  in  Note  8.34  "Example  6"  as  p  =  0.0128  or  about  1.3%.  Since 
p  —  0.0128  >  0.01  =  a(or  1.3%  is  greater  than  1%),  the  decision  turned  out  to  be 
not  to  reject.  The  event  observed  was  unlikely,  but  not  sufficiently  unlikely  to  lead 
to  rejection  of  the  null  hypothesis. 


The  reasoning  just  presented  is  the  basis  for  a  slightly  different  but  equivalent 
formulation  of  the  hypothesis  testing  process.  The  first  three  steps  are  the  same  as 
before,  but  instead  of  using  a  to  compute  critical  values  and  construct  a  rejection 
region,  one  computes  the  p-value  p  of  the  test  and  compares  it  to  a,  rejecting  Ho  if 
p  <  a  and  not  rejecting  if p  >  a. 


Systematic  Hypothesis  Testing  Procedure:  p-Value 
Approach 

1.  Identify  the  null  and  alternative  hypotheses. 

2.  Identify  the  relevant  test  statistic  and  its  distribution. 

3.  Compute  from  the  data  the  value  of  the  test  statistic. 

4.  Compute  the  p-value  of  the  test. 

5.  Compare  the  value  computed  in  Step  4  to  significance  level  a  and 
make  a  decision:  reject  Ho  if  p  <  a  and  do  not  reject  Ho  if  p  >  a. 
Formulate  the  decision  in  the  context  of  the  problem,  if  applicable. 
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EXAMPLE  7 


The  total  score  in  a  professional  basketball  game  is  the  sum  of  the  scores  of 
the  two  teams.  An  expert  commentator  claims  that  the  average  total  score 
for  NBA  games  is  202.5.  A  fan  suspects  that  this  is  an  overstatement  and  that 
the  actual  average  is  less  than  202.5.  He  selects  a  random  sample  of  85  games 
and  obtains  a  mean  total  score  of  199.2  with  standard  deviation  19.63. 
Determine,  at  the  5%  level  of  significance,  whether  there  is  sufficient 
evidence  in  the  sample  to  reject  the  expert  commentator’s  claim. 

Solution: 

•  Step  1.  Let  /i  be  the  true  average  total  game  score  of  all  NBA 
games.  The  relevant  test  is 

H0  :  =  202.5 

vs.  Ha  :  fi  <  202.5  @  a  —  0.05 

•  Step  2.  The  sample  is  large  and  the  population  standard  deviation 
is  unknown.  Thus  the  test  statistic  is 


S  /  yjn 

and  has  the  standard  normal  distribution. 

•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 
gives 


J-//0  199.2  -202.5 
s  /  y/n  19.63  /  V85 


-1.55 


•  Step  4.  The  area  of  the  left  tail  cut  off  by  Z  =  —1.55  is,  by  Figure  12.2 
"Cumulative  Normal  Probability",  0.0606,  as  illustrated  in  Figure  8.8 
"Test  Statistic  for  ".  Since  the  test  is  left-tailed,  the  p-value  is  just  this 
number,  p  =  0.0606. 
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•  Step  5.  Since p  =  0.0606  >  0.05  =  a  ,  the  decision  is  not  to 
reject  Ho.  In  the  context  of  the  problem  our  conclusion  is: 

The  data  do  not  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  the  average  total  score  of  NBA 
games  is  less  than  202.5. 

Figure  8.8 

Test  Statistic  fo  Note 
8.3 6  "Example  7" 


Ha:  fi<  202.5 
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Mr.  Prospero  has  been  teaching  Algebra  II  from  a  particular  textbook  at 
Remote  Isle  High  School  for  many  years.  Over  the  years  students  in  his 


Algebra  II  classes  have  consistently  scored  an  average  of  67  on  the  end  of 


course  exam  (EOC).  This  year  Mr.  Prospero  used  a  new  textbook  in  the  hope 
that  the  average  score  on  the  EOC  test  would  be  higher.  The  average  EOC  test 
score  of  the  64  students  who  took  Algebra  II  from  Mr.  Prospero  this  year  had 
mean  69.4  and  sample  standard  deviation  6.1.  Determine  whether  these  data 
provide  sufficient  evidence,  at  the  1%  level  of  significance,  to  conclude  that 
the  average  EOC  test  score  is  higher  with  the  new  textbook. 


Solution: 


•  Step  1.  Let  /i  be  the  true  average  score  on  the  EOC  exam  of  all  Mr. 
Prospero’s  students  who  take  the  Algebra  II  course  with  the  new 
textbook.  The  natural  statement  that  would  be  assumed  true 
unless  there  were  strong  evidence  to  the  contrary  is  that  the  new 
book  is  about  the  same  as  the  old  one.  The  alternative,  which  it 
takes  evidence  to  establish,  is  that  the  new  book  is  better,  which 
corresponds  to  a  higher  value  of  /j.  Thus  the  relevant  test  is 


H0  :  n  =  61 

vs .Ha\  \i  >  67  @  a  —  0.01 


•  Step  2.  The  sample  is  large  and  the  population  standard  deviation 
is  unknown.  Thus  the  test  statistic  is 


X  ~l*o 


Z  = 


and  has  the  standard  normal  distribution. 

•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 


gives 
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z  = 


x  -  hq  69.4  -  67 


=  3.15 


•  Step  4.  The  area  of  the  right  tail  cut  off  by  z  =  3.15  is,  by  Figure  12.2 
"Cumulative  Normal  Probability".  1  —  0.9992  =  0.0008  ,  as  shown 
in  Figure  8.9  "Test  Statistic  for  ".  Since  the  test  is  right-tailed,  the  p- 
value  is  just  this  number,  p  =  0.0008. 

•  Step  5.  Since/?  =  0.0008  <  0.01  =  a  ,  the  decision  is  to 
reject  Ho.  In  the  context  of  the  problem  our  conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  average  EOC  exam  score  of 
students  taking  the  Algebra  II  course  from  Mr.  Prospero  using 
the  new  book  is  higher  than  the  average  score  of  those  taking  the 
course  from  him  but  using  the  old  book. 

Figures.  9 

Test  Statistic  for  Note 
8.37  "Example  8" 


Ha  :  n>  67 
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EXAMPLE  9 


For  the  surface  water  in  a  particular  lake,  local  environmental  scientists 
would  like  to  maintain  an  average  pH  level  at  7.4.  Water  samples  are 
routinely  collected  to  monitor  the  average  pH  level.  If  there  is  evidence  of  a 
shift  in  pH  value,  in  either  direction,  then  remedial  action  will  be  taken.  On  a 
particular  day  30  water  samples  are  taken  and  yield  average  pH  reading  of 
7.7  with  sample  standard  deviation  0.5.  Determine,  at  the  1%  level  of 
significance,  whether  there  is  sufficient  evidence  in  the  sample  to  indicate 
that  remedial  action  should  be  taken. 

Solution: 


•  Step  1.  Let  /i  be  the  true  average  pH  level  at  the  time  the  samples 
were  taken.  The  relevant  test  is 

H0  :  ]i  =  7.4 

vs ±  1A  @a  =  0.01 

•  Step  2.  The  sample  is  large  and  the  population  standard  deviation 
is  unknown.  Thus  the  test  statistic  is 


S  /  yjn 

and  has  the  standard  normal  distribution. 

•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 
gives 


z  = 


X  -  Mo 
S  I  y/n 


7.7  -  7.4 
0.5  /  y/30 


3.29 


•  Step  4.  The  area  of  the  right  tail  cut  off  by  z  =  3.29  is,  by  Figure  12.2 

"Cumulative  Normal  Probability",  1  —  0.9995  =  0.0005  ,  as 
illustrated  in  Figure  8.10  "Test  Statistic  for  ".  Since  the  test  is  two-tailed, 
the  p-value  is  the  double  of  this  number, 

p  =  2  x  0.0005  =  0.0010. 
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•  Step  5.  Since/?  =  0.0010  <  0.01  =  a  ,  the  decision  is  to 
reject  Hq.  In  the  context  of  the  problem  our  conclusion  is: 


The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  average  pH  of  surface  water  in 
the  lake  is  different  from  7.4.  That  is,  remedial  action  is 
indicated. 

Figure  8.10 

Test  Statistic  for  Note 
8.38  "Exanwle  9" 

Ha:f 7.4 


KEY  TAKEAWAYS 


•  The  observed  significance  or  p-value  of  a  test  is  a  measure  of  how 
inconsistent  the  sample  result  is  with  Ho  and  in  favor  of  Ha. 

•  The  p-value  approach  to  hypothesis  testing  means  that  one  merely 
compares  the  p-value  to  a  instead  of  constructing  a  rejection  region. 

•  There  is  a  systematic  five-step  procedure  for  the  p-value  approach  to 
hypothesis  testing. 
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1.  Compute  the  observed  significance  of  each  test. 

a.  Testing  Hq  fA  =  54.7  vs.  Ha  :  fA  <  54.7,  test  statistic  Z  =  —1.72. 

b.  Testing  Hq  :  fA  =  195  vs.  Ha  fA  ^  195,  test  statistic  Z  =  —2.07. 

c.  Testing  Hq  :  fA  =  —45  vs.  Ha  fA  >  —45,  test  statistic  z  =  2.54. 

2.  Compute  the  observed  significance  of  each  test. 

a.  Testing  Hq  \  [A  =  Ovs.  Ha  fl  ^  Q  test  statistic  z  =  2.82. 

b.  Testing  Hq  :  fA  =  18.4  vs.  Ha  /A  <  1 8.4,  test  statistic  Z  =  —1.74. 

c.  Testing  Hq  /A  =  63.85  vs.  Ha  /A  >  63.85,  test  statistic  z  =  1.93. 

3.  Compute  the  observed  significance  of  each  test.  (Some  of  the  information 
given  might  not  be  needed.) 

a.  Testing  Hq  fi  =■  27.5  vs .  Ha  :  fl  >  27.5;n  =  49,X  =  28.9,  s  =  3.14, 
test  statistic  z  =  3.12. 

b.  Testing  Hq  \  /A  =  581  vs.  Ha  fA  <  581;n  =  32,  X  =  560,  s  =  47.8, 
test  statistic  Z  =  —2.49. 

c.  Testing  Hq  fA  =  138.5  vs.  Ha  fA  ^  138.5;  n  =  44,  X  =  137.6,  s  = 
2.45,  test  statistic  Z  =  —2.44. 

4.  Compute  the  observed  significance  of  each  test.  (Some  of  the  information 
given  might  not  be  needed.) 

a.  Testing//o  :  —  — 17.9vs .  Ha  fA  <  — 17.9;n  =  34,X  =  — 18. 2, s 

=  0.87,  test  statistic  Z  =  —2.01. 

b.  Testing  Hq  \  fA  =  5.5  vs ,Ha  /A  ^  5.5;n  =  56,  X  =  7.4,  s  =  4.82,  test 
statistic  z  =  2.95. 

c.  Testing  Hq  fA  =  1255  vs.  Ha  \  fA  >  1255;  n  =  152,  X  =  1257,  s  = 

7.5,  test  statistic  z  =  3.29. 

5.  Make  the  decision  in  each  test,  based  on  the  information  provided. 

a.  Testing  Hq  \  fA  =  82.9  vs.  Ha  fl  <  82.9  @  a  =  0.05,  observed 
significance  p  =  0.038. 

b.  Testing  Hq  fl  =  213.5  vs.  Ha  fl  ^  213.5  @  a  —  0.01 ,  observed 

significance  p  =  0.038. 

6.  Make  the  decision  in  each  test,  based  on  the  information  provided. 
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a.  Testing//o  I  ft  =  31.4vs .  Ha  ft  >  31.4@a  =  0. 10,  observed 

significance  p  =  0.062. 

b.  Testing//o  :  ft  —  —75.5  vs .  Ha  :  ft  <  —  15.5@a  =  0.05, 

observed  significance  p  =  0.062. 


APPLICATIONS 


7.  A  lawyer  believes  that  a  certain  judge  imposes  prison  sentences  for  property 
crimes  that  are  longer  than  the  state  average  11.7  months.  He  randomly  selects 
36  of  the  judge’s  sentences  and  obtains  mean  13.8  and  standard  deviation  3.9 
months. 

a.  Perform  the  test  at  the  1%  level  of  significance  using  the  critical  value 
approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  1%  level  of  significance  using  the  p-value  approach. 
You  need  not  repeat  the  first  three  steps,  already  done  in  part  (a). 

8.  In  a  recent  year  the  fuel  economy  of  all  passenger  vehicles  was  19.8  mpg.  A 
trade  organization  sampled  50  passenger  vehicles  for  fuel  economy  and 
obtained  a  sample  mean  of  20.1  mpg  with  standard  deviation  2.45  mpg.  The 
sample  mean  20.1  exceeds  19.8,  but  perhaps  the  increase  is  only  a  result  of 
sampling  error. 

a.  Perform  the  relevant  test  of  hypotheses  at  the  20%  level  of  significance 
using  the  critical  value  approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  20%  level  of  significance  using  the  p-value 
approach.  You  need  not  repeat  the  first  three  steps,  already  done  in  part 
(a). 

9.  The  mean  score  on  a  25-point  placement  exam  in  mathematics  used  for  the 
past  two  years  at  a  large  state  university  is  14.3.  The  placement  coordinator 
wishes  to  test  whether  the  mean  score  on  a  revised  version  of  the  exam  differs 
from  14.3.  She  gives  the  revised  exam  to  30  entering  freshmen  early  in  the 
summer;  the  mean  score  is  14.6  with  standard  deviation  2.4. 

a.  Perform  the  test  at  the  10%  level  of  significance  using  the  critical  value 
approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  10%  level  of  significance  using  the  p-value 
approach.  You  need  not  repeat  the  first  three  steps,  already  done  in  part 
(a). 
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10.  The  mean  increase  in  word  family  vocabulary  among  students  in  a  one-year 
foreign  language  course  is  576  word  families.  In  order  to  estimate  the  effect  of 
a  new  type  of  class  scheduling,  an  instructor  monitors  the  progress  of  60 
students;  the  sample  mean  increase  in  word  family  vocabulary  of  these 
students  is  542  word  families  with  sample  standard  deviation  18  word  families. 

a.  Test  at  the  5%  level  of  significance  whether  the  mean  increase  with  the 
new  class  scheduling  is  different  from  576  word  families,  using  the  critical 
value  approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  5%  level  of  significance  using  the  p-value  approach. 
You  need  not  repeat  the  first  three  steps,  already  done  in  part  (a). 

11.  The  mean  yield  for  hard  red  winter  wheat  in  a  certain  state  is  44.8  bu/acre.  In  a 
pilot  program  a  modified  growing  scheme  was  introduced  on  35  independent 
plots.  The  result  was  a  sample  mean  yield  of  45.4  bu/acre  with  sample  standard 
deviation  1.6  bu/acre,  an  apparent  increase  in  yield. 

a.  Test  at  the  5%  level  of  significance  whether  the  mean  yield  under  the  new 
scheme  is  greater  than  44.8  bu/acre,  using  the  critical  value  approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  5%  level  of  significance  using  the  p-value  approach. 
You  need  not  repeat  the  first  three  steps,  already  done  in  part  (a). 

12.  The  average  amount  of  time  that  visitors  spent  looking  at  a  retail  company’s 
old  home  page  on  the  world  wide  web  was  23.6  seconds.  The  company 
commissions  a  new  home  page.  On  its  first  day  in  place  the  mean  time  spent  at 
the  new  page  by  7,628  visitors  was  23.5  seconds  with  standard  deviation  5.1 
seconds. 

a.  Test  at  the  5%  level  of  significance  whether  the  mean  visit  time  for  the  new 
page  is  less  than  the  former  mean  of  23.6  seconds,  using  the  critical  value 
approach. 

b.  Compute  the  observed  significance  of  the  test. 

c.  Perform  the  test  at  the  5%  level  of  significance  using  the  p-value  approach. 
You  need  not  repeat  the  first  three  steps,  already  done  in  part  (a). 
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ANSWERS 


i.  a.  p  -value  =  0.0427 

b.  p  -value  =  0.0384 

c.  p  -value  =  0.0055 

3.  a.  p  -value  =  0.0009 

b.  p  -value  =  0.0064 

c.  p  -value  =  0.0146 

5.  a.  reject  Ho 

b.  do  not  reject  Ho 

7.  a.  Z  =  3.23,  Zo.01  =  2.33,  reject  Ho 

b.  p  -value  =  0.0006 

c.  reject  Ho 

9.  a .  Z  =  0.68,  Zo.05  —  1  -645 ,  do  not  reject  Ho 

b.  p  -value  =  0.4966 

c.  do  not  reject  Ho 

11.  a.  Z  =  2.22,  Zo.05  —  1 .645 ,  reject  Ho 

b.  p  -value  =  0.0132 

c.  reject  Hq 
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8.4  Small  Sample  Tests  for  a  Population  Mean 


LEARNING  OBJECTIVE 

1.  To  learn  how  to  apply  the  five-step  test  procedure  for  test  of  hypotheses 
concerning  a  population  mean  when  the  sample  size  is  small. 


In  the  previous  section  hypotheses  testing  for  population  means  was  described  in 
the  case  of  large  samples.  The  statistical  validity  of  the  tests  was  insured  by  the 
Central  Limit  Theorem,  with  essentially  no  assumptions  on  the  distribution  of  the 
population.  When  sample  sizes  are  small,  as  is  often  the  case  in  practice,  the  Central 
Limit  Theorem  does  not  apply.  One  must  then  impose  stricter  assumptions  on  the 
population  to  give  statistical  validity  to  the  test  procedure.  One  common 
assumption  is  that  the  population  from  which  the  sample  is  taken  has  a  normal 
probability  distribution  to  begin  with.  Under  such  circumstances,  if  the  population 

standard  deviation  is  known,  then  the  test  statistic  ( x  —  /aq  )  j  [a  j  still 

has  the  standard  normal  distribution,  as  in  the  previous  two  sections,  if  a  is 
unknown  and  is  approximated  by  the  sample  standard  deviation  s,  then  the 

resulting  test  statistic  (x  — 

n—  1  degrees  of  freedom. 


>/  (J 


n  (follows  Student’s  t-distribution  with 
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Standardized  Test  Statistics  for  Small  Sample  Hypothesis 
Tests  Concerning  a  Single  Population  Mean 


If  a  is  known:  Z  = 


If  a  is  unknown:  T  = 


x  -  Mo 

a  I  \Jn 
*  ~  Mo 


The  first  test  statistic  (a  known)  has  the  standard  normal  distribution. 


The  second  test  statistic  ( o  unknown)  has  Student’s  t-distribution  with  n—  1 
degrees  of  freedom. 

The  population  must  be  normally  distributed. 


The  distribution  of  the  second  standardized  test  statistic  (the  one  containing  s)  and 
the  corresponding  rejection  region  for  each  form  of  the  alternative  hypothesis  (left¬ 
tailed,  right-tailed,  or  two-tailed),  is  shown  in  Figure  8.11  "Distribution  of  the 
Standardized  Test  Statistic  and  the  Rejection  Region".  This  is  just  like  Figure  8,4 
"Distribution  of  the  Standardized  Test  Statistic  and  the  Rejection  Region",  except 
that  now  the  critical  values  are  from  the  t-distribution.  Figure  8,4  "Distribution  of 
the  Standardized  Test  Statistic  and  the  Rejection  Region"  still  applies  to  the  first 
standardized  test  statistic  (the  one  containing  a)  since  it  follows  the  standard 
normal  distribution. 
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Figure  8.11  Distribution  of  the  Standardized  Test  Statistic  and  the  Rejection  Region 


H„  \p<  jia 


Hd\fi>  fi0 


Reject  Hq 


Ha\p*  p  o 


The  p-value  of  a  test  of  hypotheses  for  which  the  test  statistic  has  Student’s  t- 
distribution  can  be  computed  using  statistical  software,  but  it  is  impractical  to  do  so 
using  tables,  since  that  would  require  30  tables  analogous  to  Figure  12.2 
"Cumulative  Normal  Probability",  one  for  each  degree  of  freedom  from  1  to  30. 
Figure  12.3  "Critical  Values  of"  can  be  used  to  approximate  the  p-value  of  such  a 
test,  and  this  is  typically  adequate  for  making  a  decision  using  the  p-value  approach 
to  hypothesis  testing,  although  not  always.  For  this  reason  the  tests  in  the  two 
examples  in  this  section  will  be  made  following  the  critical  value  approach  to 
hypothesis  testing  summarized  at  the  end  of  Section  8.1  "The  Elements  of 
Hypothesis  Testing",  but  after  each  one  we  will  show  how  the  p-value  approach 
could  have  been  used. 
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The  price  of  a  popular  tennis  racket  at  a  national  chain  store  is  $179.  Portia 
bought  five  of  the  same  racket  at  an  online  auction  site  for  the  following 
prices: 


155  179  175  175  161 


Assuming  that  the  auction  prices  of  rackets  are  normally  distributed, 
determine  whether  there  is  sufficient  evidence  in  the  sample,  at  the  5%  level 
of  significance,  to  conclude  that  the  average  price  of  the  racket  is  less  than 
$179  if  purchased  at  an  online  auction. 

Solution: 

•  Step  1.  The  assertion  for  which  evidence  must  be  provided  is  that 
the  average  online  price  /i  is  less  than  the  average  price  in  retail 
stores,  so  the  hypothesis  test  is 


H0  :  n  =  179 

vs .Ha  :  ji  <  179  @  a  =  0.05 


•  Step  2.  The  sample  is  small  and  the  population  standard 
deviation  is  unknown.  Thus  the  test  statistic  is 


x  -l*o 


T  = 


and  has  the  Student  t-distribution  with  n—\  =5  —  1=4 
degrees  of  freedom. 

•  Step  3.  From  the  data  we  compute  X  =  169  and  s  =  10.39. 
Inserting  these  values  into  the  formula  for  the  test  statistic  gives 


T  = 


x  -  no  169-179 


=  -2.152 


/  V"  10.39  /  V5 
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•  Step  4.  Since  the  symbol  in  Ha  is  “<”  this  is  a  left-tailed  test,  so  there  is  a 
single  critical  value,  —ta  =  —  fo.05  \df  —  4]  .  Reading  from  the  row 
labeled  df  =  4  in  Figure  12.3  "Critical  Values  of"  its  value  is  -2.132. 
The  rejection  region  is  oo,  —2.  132]  . 

•  Step  5.  As  shown  in  Figure  8.12  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  the  average  price  of  such  rackets 
purchased  at  online  auctions  is  less  than  $179. 


Figure  8.12 

Rejection  Region  and 
Test  Statistic  for  Note 
8.42  "Example  10" 

Ha:ii<  179 


To  perform  the  test  in  Note  8.42  "Example  10"  using  the  p-value  approach,  look  in 
the  row  in  Figure  12.3  "Critical  Values  of"  with  the  heading  df  —  4  and  search  for 
the  two  t-values  that  bracket  the  unsigned  value  2.152  of  the  test  statistic.  They  are 
2.132  and  2.776,  in  the  columns  with  headings  to.oso  and  to.025-  They  cut  off  right 
tails  of  area  0.050  and  0.025,  so  because  2.152  is  between  them  it  must  cut  off  a  tail 
of  area  between  0.050  and  0.025.  By  symmetry  -2.152  cuts  off  a  left  tail  of  area 
between  0.050  and  0.025,  hence  the  p-value  corresponding  to  t  —  — 2. 152  is 
between  0.025  and  0.05.  Although  its  precise  value  is  unknown,  it  must  be  less  than 
a  —  0.05,  so  the  decision  is  to  reject  Hq. 
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EXAMPLE  11 


A  small  component  in  an  electronic  device  has  two  small  holes  where 
another  tiny  part  is  fitted.  In  the  manufacturing  process  the  average 
distance  between  the  two  holes  must  be  tightly  controlled  at  0.02  mm,  else 
many  units  would  be  defective  and  wasted.  Many  times  throughout  the  day 
quality  control  engineers  take  a  small  sample  of  the  components  from  the 
production  line,  measure  the  distance  between  the  two  holes,  and  make 
adjustments  if  needed.  Suppose  at  one  time  four  units  are  taken  and  the 
distances  are  measured  as 

0.021  0.019  0.023  0.020 

Determine,  at  the  1%  level  of  significance,  if  there  is  sufficient  evidence  in 
the  sample  to  conclude  that  an  adjustment  is  needed.  Assume  the  distances 
of  interest  are  normally  distributed. 

Solution: 


•  Step  1.  The  assumption  is  that  the  process  is  under  control  unless 
there  is  strong  evidence  to  the  contrary.  Since  a  deviation  of  the 
average  distance  to  either  side  is  undesirable,  the  relevant  test  is 

H0  :  /i  =  0.02 

vs .Ha\/i  ±  0.02  @  a  =  0.01 

where  /j  denotes  the  mean  distance  between  the  holes. 

•  Step  2.  The  sample  is  small  and  the  population  standard 
deviation  is  unknown.  Thus  the  test  statistic  is 

T  _  x-Mo 
s  /  yjn 

and  has  the  Student  t-distribution  with  n—\  =4—1  =  3 
degrees  of  freedom. 
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•  Step  3.  From  the  data  we  compute  X  =  0.02075  and  s  = 
0.00171.  Inserting  these  values  into  the  formula  for  the  test 
statistic  gives 


x  -  no  0.02075  -  0.02 
S  /  yfn  0.00171  /  a/4 


0.877 


•  Step  4.  Since  the  symbol  in  Ha  is  this  is  a  two-tailed  test,  so  there  are 
two  critical  values,  ±ftt/2  —  —  4).005  \df  —  3]  .  Reading  from  the 
row  in  Figure  12.3  "Critical  Values  of"  labeled  df  =  3  their  values  are 
±5.841.  The  rejection  region  is  (—  oo,  —  5 .84 1 J  U  [5.841,  co)  . 

•  Step  5.  As  shown  in  Figure  8.13  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  does  not  fall  in  the  rejection 
region.  The  decision  is  not  to  reject  Ho.  In  the  context  of  the 
problem  our  conclusion  is: 


The  data  do  not  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  distance  between  the 
holes  in  the  component  differs  from  0.02  mm. 


Figure  8.13 

Rejection  Region  and 
Test  Statistic  for  Note 
8.43  "Example  11" 

:/l^  0.02 


To  perform  the  test  in  Note  8.43  "Example  11"  using  the  p-value  approach,  look  in 
the  row  in  Figure  12.3  "Critical  Values  of"  with  the  heading  df  —  3  and  search  for 
the  two  t-values  that  bracket  the  value  0.877  of  the  test  statistic.  Actually  0.877  is 
smaller  than  the  smallest  number  in  the  row,  which  is  0.978,  in  the  column  with 
heading  to.200 •  The  value  0.978  cuts  off  a  right  tail  of  area  0.200,  so  because  0.877  is 
to  its  left  it  must  cut  off  a  tail  of  area  greater  than  0.200.  Thus  the  p-value,  which  is 
the  double  of  the  area  cut  off  (since  the  test  is  two-tailed),  is  greater  than  0.400. 
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Although  its  precise  value  is  unknown,  it  must  be  greater  than  a  —  0.01,  so  the 
decision  is  not  to  reject  Hq. 


KEY  TAKEAWAYS 


•  There  are  two  formulas  for  the  test  statistic  in  testing  hypotheses  about 
a  population  mean  with  small  samples.  One  test  statistic  follows  the 
standard  normal  distribution,  the  other  Student’s  t-distribution. 

•  The  population  standard  deviation  is  used  if  it  is  known,  otherwise  the 
sample  standard  deviation  is  used. 

•  Either  five-step  procedure,  critical  value  or  p-value  approach,  is  used 
with  either  test  statistic. 
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1.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test  based  on  the  information  given.  The  population  is  normally 
distributed. 

a.  Hq  :  n  =  27 vs. Ha  \  fi  <21  @a  —  0.05, n  =  12, a  =  2.2. 

b.  Hq  :  fA  =  52  vs  ,Ha  :  fA  ±  52@a  =  0.05,  n  =  6,  a  unknown. 

c.  Hq  :  /a  =  — 105  vs.  Ha  :  /a  >  — 105  @  a  =  0.10,n  =  24,  a 

unknown. 

d.  Hq  :  /a  =  78.8 vs.Ha  :  /a  7^  78.8@a  =  0.10,  n  =  8,  <7=1.7. 

2.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test  based  on  the  information  given.  The  population  is  normally 
distributed. 

a.  Hq  :  fA  =  llvs.Ha  :  /A  <  11  @  OC  =  0.01,  n  =  26,  0  =  0.94. 

b.  Hq  :  fA  =  880  vs.  Ha  :  //  7^  880  @  a  =  0.01,n  =  4,  CT  unknown. 

c.  Hq  :  fA  =  — 12  vs.  Ha  :  //  >  — 12  @  a  =  0.05,  n  =  is,  0  =  1.1. 

d.  Hq  :  fA  =  ll.lws.Ha  \  fA  7^  21.1@a  =  0.05,  n  =  23,  a  unknown. 

3.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test  based  on  the  information  given.  The  population  is  normally 
distributed.  Identify  the  test  as  left-tailed,  right-tailed,  or  two-tailed. 

a.  Ho  :  fA  =  141  vs ,Ha  :  fA  <  141  @  a  =  0.20,  n  =  29,  a  unknown. 

b.  Hq  :  fA  =  —  54 vs. Ha  :  /a  <  —54 @a  =  0.05, n  =  15,  <7=1.9. 

c.  Hq  :  fl  =  98.6  vs ,Ha  :  fA  7^  98.6  @  a  =  0.05,  n  =  12,  a  unknown. 

d.  Hq  :  fA  =  3.8  vs ,Ha  :  fA  >  3.8  @  a  =  0.001,n  =  27,  aunknown. 

4.  Find  the  rejection  region  (for  the  standardized  test  statistic)  for  each 
hypothesis  test  based  on  the  information  given.  The  population  is  normally 
distributed.  Identify  the  test  as  left-tailed,  right-tailed,  or  two-tailed. 

a.  Hq  :  fA  =  —62vs.Ha  :  fA  7^  —  62 @  a  =  0. 005,  n  =  8,  aunknown. 

b.  Hq  :  fA  =  13  vs.Ha  :  fA  >  13  @  a  =  0.001,  n  =  22,  a  unknown. 

c.  Hq  :  (a  =  1 124  vs.  Ha  :  /a  <  1 124  @  a  =  0.001 ,  n  =  21,  <7 

unknown. 

d.  Hq  :  fA  =  0.12  vs.Ha  '.  fA  7^  0.12@a  =  0.001 ,  n  =  14,  0=  0.026. 

5.  A  random  sample  of  size  20  drawn  from  a  normal  population  yielded  the 
following  results:  X  =  49.2,  s  =  1.33. 
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a.  JestHo  :  fi  =  50vs ,Ha  :  /i  ^  50@a  =  0.01. 

b.  Estimate  the  observed  significance  of  the  test  in  part  (a)  and  state  a 
decision  based  on  the  p-value  approach  to  hypothesis  testing. 

6.  A  random  sample  of  size  16  drawn  from  a  normal  population  yielded  the 
following  results:  X  =  — 0.96,  s  =  1.07. 

a.  Test//o  •  ^  =  Ovs.  Ha  \  fl  <  0@  a  =  0.001. 

b.  Estimate  the  observed  significance  of  the  test  in  part  (a)  and  state  a 
decision  based  on  the  p-value  approach  to  hypothesis  testing. 

7.  A  random  sample  of  size  8  drawn  from  a  normal  population  yielded  the 
following  results:  X  =  289,  s  =  46. 

a.  Test//o  :  M  —  250 vs.  Ha  :  pi  >  250  @  a  =  0.05. 

b.  Estimate  the  observed  significance  of  the  test  in  part  (a)  and  state  a 
decision  based  on  the  p-value  approach  to  hypothesis  testing. 

8.  A  random  sample  of  size  12  drawn  from  a  normal  population  yielded  the 
following  results:  X  =  86.2,  s  =  0.63. 

a.  JestHo  :  pi  =  85.5  vs ,Ha  :  pi  ^  85.5  @  a  =  0.01. 

b.  Estimate  the  observed  significance  of  the  test  in  part  (a)  and  state  a 
decision  based  on  the  p-value  approach  to  hypothesis  testing. 


APPLICATIONS 


9.  Researchers  wish  to  test  the  efficacy  of  a  program  intended  to  reduce  the 
length  of  labor  in  childbirth.  The  accepted  mean  labor  time  in  the  birth  of  a 
first  child  is  15.3  hours.  The  mean  length  of  the  labors  of  13  first-time  mothers 
in  a  pilot  program  was  8.8  hours  with  standard  deviation  3.1  hours.  Assuming  a 
normal  distribution  of  times  of  labor,  test  at  the  10%  level  of  significance  test 
whether  the  mean  labor  time  for  all  women  following  this  program  is  less  than 
15.3  hours. 

10.  A  dairy  farm  uses  the  somatic  cell  count  (SCC)  report  on  the  milk  it  provides  to 
a  processor  as  one  way  to  monitor  the  health  of  its  herd.  The  mean  SCC  from 
five  samples  of  raw  milk  was  250,000  cells  per  milliliter  with  standard 
deviation  37,500  cell/ml.  Test  whether  these  data  provide  sufficient  evidence, 
at  the  10%  level  of  significance,  to  conclude  that  the  mean  SCC  of  all  milk 
produced  at  the  dairy  exceeds  that  in  the  previous  report,  210,250  cell/ml. 
Assume  a  normal  distribution  of  SCC. 

11.  Six  coins  of  the  same  type  are  discovered  at  an  archaeological  site,  if  their 
weights  on  average  are  significantly  different  from  5.25  grams  then  it  can  be 
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assumed  that  their  provenance  is  not  the  site  itself.  The  coins  are  weighed  and 
have  mean  4.73  g  with  sample  standard  deviation  0.18  g.  Perform  the  relevant 
test  at  the  0.1%  (l/lOth  of  1%)  level  of  significance,  assuming  a  normal 
distribution  of  weights  of  all  such  coins. 

12.  An  economist  wishes  to  determine  whether  people  are  driving  less  than  in  the 
past.  In  one  region  of  the  country  the  number  of  miles  driven  per  household 
per  year  in  the  past  was  18.59  thousand  miles.  A  sample  of  15  households 
produced  a  sample  mean  of  16.23  thousand  miles  for  the  last  year,  with  sample 
standard  deviation  4.06  thousand  miles.  Assuming  a  normal  distribution  of 
household  driving  distances  per  year,  perform  the  relevant  test  at  the  5%  level 
of  significance. 

13.  The  recommended  daily  allowance  of  iron  for  females  aged  19-50  is  18  mg/ day. 
A  careful  measurement  of  the  daily  iron  intake  of  15  women  yielded  a  mean 
daily  intake  of  16.2  mg  with  sample  standard  deviation  4.7  mg. 

a.  Assuming  that  daily  iron  intake  in  women  is  normally  distributed,  perform 
the  test  that  the  actual  mean  daily  intake  for  all  women  is  different  from 
18  mg/ day,  at  the  10%  level  of  significance. 

b.  The  sample  mean  is  less  than  18,  suggesting  that  the  actual  population 
mean  is  less  than  18  mg/ day.  Perform  this  test,  also  at  the  10%  level  of 
significance.  (The  computation  of  the  test  statistic  done  in  part  (a)  still 
applies  here.) 

14.  The  target  temperature  for  a  hot  beverage  the  moment  it  is  dispensed  from  a 
vending  machine  is  170°F.  A  sample  of  ten  randomly  selected  servings  from  a 
new  machine  undergoing  a  pre-shipment  inspection  gave  mean  temperature 
173°F  with  sample  standard  deviation  6.3°F. 

a.  Assuming  that  temperature  is  normally  distributed,  perform  the  test  that 
the  mean  temperature  of  dispensed  beverages  is  different  from  170°F,  at 
the  10%  level  of  significance. 

b.  The  sample  mean  is  greater  than  170,  suggesting  that  the  actual  population 
mean  is  greater  than  170°F.  Perform  this  test,  also  at  the  10%  level  of 
significance.  (The  computation  of  the  test  statistic  done  in  part  (a)  still 
applies  here.) 

15.  The  average  number  of  days  to  complete  recovery  from  a  particular  type  of 
knee  operation  is  123.7  days.  From  his  experience  a  physician  suspects  that  use 
of  a  topical  pain  medication  might  be  lengthening  the  recovery  time.  He 
randomly  selects  the  records  of  seven  knee  surgery  patients  who  used  the 
topical  medication.  The  times  to  total  recovery  were: 

128  135  121  142  126  151  123 
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a.  Assuming  a  normal  distribution  of  recovery  times,  perform  the  relevant 
test  of  hypotheses  at  the  10%  level  of  significance. 

b.  Would  the  decision  be  the  same  at  the  5%  level  of  significance?  Answer 
either  by  constructing  a  new  rejection  region  (critical  value  approach)  or 
by  estimating  the  p-value  of  the  test  in  part  (a)  and  comparing  it  to  a. 

16.  A  24-hour  advance  prediction  of  a  day’s  high  temperature  is  “unbiased”  if  the 
long-term  average  of  the  error  in  prediction  (true  high  temperature  minus 
predicted  high  temperature)  is  zero.  The  errors  in  predictions  made  by  one 
meteorological  station  for  20  randomly  selected  days  were: 

2  0-3  1-2 

1  0-1  1-1 

-4  1  1-40 

-4  -3-4  2  2 

a.  Assuming  a  normal  distribution  of  errors,  test  the  null  hypothesis  that  the 
predictions  are  unbiased  (the  mean  of  the  population  of  all  errors  is  0) 
versus  the  alternative  that  it  is  biased  (the  population  mean  is  not  0),  at 
the  1%  level  of  significance. 

b.  Would  the  decision  be  the  same  at  the  5%  level  of  significance?  The  10% 
level  of  significance?  Answer  either  by  constructing  new  rejection  regions 
(critical  value  approach)  or  by  estimating  the  p-value  of  the  test  in  part  (a) 
and  comparing  it  to  (X. 

17.  Pasteurized  milk  may  not  have  a  standardized  plate  count  (SPC)  above  20,000 
colony -forming  bacteria  per  milliliter  (cfu/ml).  The  mean  SPC  for  five  samples 
was  21,500  cfu/ml  with  sample  standard  deviation  750  cfu/ml.  Test  the  null 
hypothesis  that  the  mean  SPC  for  this  milk  is  20,000  versus  the  alternative  that 
it  is  greater  than  20,000,  at  the  10%  level  of  significance.  Assume  that  the  SPC 
follows  a  normal  distribution. 

18.  One  water  quality  standard  for  water  that  is  discharged  into  a  particular  type 
of  stream  or  pond  is  that  the  average  daily  water  temperature  be  at  most  18°C. 
Six  samples  taken  throughout  the  day  gave  the  data: 

16.8  21.5  19.1  12.8  18.0  20.7 

The  sample  mean  X  =  18.15  exceeds  18,  but  perhaps  this  is  only  sampling 
error.  Determine  whether  the  data  provide  sufficient  evidence,  at  the  10%  level 
of  significance,  to  conclude  that  the  mean  temperature  for  the  entire  day 
exceeds  18°C. 
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ADDITIONAL  EXERCISES 


19.  A  calculator  has  a  built-in  algorithm  for  generating  a  random  number 
according  to  the  standard  normal  distribution.  Twenty-five  numbers  thus 
generated  have  mean  0.15  and  sample  standard  deviation  0.94.  Test  the  null 
hypothesis  that  the  mean  of  all  numbers  so  generated  is  0  versus  the 
alternative  that  it  is  different  from  0,  at  the  20%  level  of  significance.  Assume 
that  the  numbers  do  follow  a  normal  distribution. 

20.  At  every  setting  a  high-speed  packing  machine  delivers  a  product  in  amounts 
that  vary  from  container  to  container  with  a  normal  distribution  of  standard 
deviation  0.12  ounce.  To  compare  the  amount  delivered  at  the  current  setting 
to  the  desired  amount  64.1  ounce,  a  quality  inspector  randomly  selects  five 
containers  and  measures  the  contents  of  each,  obtaining  sample  mean  63.9 
ounces  and  sample  standard  deviation  0.10  ounce.  Test  whether  the  data 
provide  sufficient  evidence,  at  the  5%  level  of  significance,  to  conclude  that  the 
mean  of  all  containers  at  the  current  setting  is  less  than  64.1  ounces. 

21.  A  manufacturing  company  receives  a  shipment  of  1,000  bolts  of  nominal  shear 
strength  4,350  lb.  A  quality  control  inspector  selects  five  bolts  at  random  and 
measures  the  shear  strength  of  each.  The  data  are: 

4,320  4,290  4,360  4,350  4,320 

a.  Assuming  a  normal  distribution  of  shear  strengths,  test  the  null  hypothesis 
that  the  mean  shear  strength  of  all  bolts  in  the  shipment  is  4,350  lb  versus 
the  alternative  that  it  is  less  than  4,350  lb,  at  the  10%  level  of  significance. 

b.  Estimate  the  p-value  (observed  significance)  of  the  test  of  part  (a). 

c.  Compare  the  p-value  found  in  part  (b)  to  a  =  0.10  and  make  a  decision 
based  on  the  p-value  approach.  Explain  fully. 

22.  A  literary  historian  examines  a  newly  discovered  document  possibly  written  by 
Oberon  Theseus.  The  mean  average  sentence  length  of  the  surviving 
undisputed  works  of  Oberon  Theseus  is  48.72  words.  The  historian  counts 
words  in  sentences  between  five  successive  101  periods  in  the  document  in 
question  to  obtain  a  mean  average  sentence  length  of  39.46  words  with 
standard  deviation  7.45  words.  (Thus  the  sample  size  is  five.) 

a.  Determine  if  these  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  average  sentence  length  in  the 
document  is  less  than  48.72. 

b.  Estimate  the  p-value  of  the  test. 

c.  Based  on  the  answers  to  parts  (a)  and  (b),  state  whether  or  not  it  is  likely 
that  the  document  was  written  by  Oberon  Theseus. 
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ANSWERS 


i.  a.  Z  <  -1.645 

b.  T  <  -2.571  or  T2  2.571 

c.  Tz  1.319 

d.  Z  <  -1645  or  Z  2  1.645 

3.  a.  T  <  -0.855 

b.  Z  <  -1.645 

c.  T  <  -2.201  or  T  2  2.201 

d.  T  2  3.435 

5.  a.  T  =  — 2.690,  df  =  19,  —  fo.005  —  — 2.861  ,  do  not  reject  Ho. 
b.  0.01  <  P  -value  <  0.02  ,a  =  0.01,  do  not  reject  Ho. 

7.  a.  T  =  2.398,  df  =  7,  ?o.05  —  1-895  ,  reject  Ho. 

b.  0.01  <  p -value  <  0.025  ,a  =  0.05,  reject  Ho. 

9.  T  =  — 7.560,  df  =  12,—  /q.IO  —  — 1.356,  reject  Ho. 

11.  T  —  — 7.076,  df  =  5,  —  ?o.0005  —  — 6. 869,  reject  Ho. 

13.  a.  T  =  — 1.483,  df  =  14,  —  ?o.o5  —  — 1.761  ,  do  not  reject  Ho; 
b.  T  =  — 1.483,  df  =  14,  — ?0.10  —  — 1.345  ,  reject  Ho; 

15.  a.  T  =  2.069,  df  =  6,  fo.10  =  1.44,  reject  Ho; 

b.  T  =  2.069,  df  =  6,  ?o.05  =  1.943  ,  reject  Ho. 

17.  T=4.472,  df  =  4,  ro.10  —  1.533,  reject  Ho. 

19.  T  =  0.798,  df  =  24,?o.lO  =  1.318,  do  not  reject  Ho. 

21.  a.  T  =  — 1.773  ,  t/f  =  4,  —  /q.05  —  -2.132  ,  do  not  reject  Ho. 

b.  0.05  <  -value  <0.10 

c.  a  =  0.05 ,  do  not  reject  Hq 
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8.5  Large  Sample  Tests  for  a  Population  Proportion 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  apply  the  five-step  critical  value  test  procedure  for  test 
of  hypotheses  concerning  a  population  proportion. 

2.  To  learn  how  to  apply  the  five-step  p-value  test  procedure  for  test  of 
hypotheses  concerning  a  population  proportion. 


Both  the  critical  value  approach  and  the  p-value  approach  can  be  applied  to  test 
hypotheses  about  a  population  proportion  p.  The  null  hypothesis  will  have  the  form 
Hq  :  p  =  p0  for  some  specific  number  po  between  0  and  1.  The  alternative 
hypothesis  will  be  one  of  the  three  inequalities  p  <  Pq,P  >  Pq,  or  p  ^  pQ  for  the 
same  number  po  that  appears  in  the  null  hypothesis. 


The  information  in  Section  6.3  "The  Sample  Proportion"  in  Chapter  6  "Sampling 
Distributions"  gives  the  following  formula  for  the  test  statistic  and  its  distribution. 
In  the  formula  po  is  the  numerical  value  of  p  that  appears  in  the  two  hypotheses, 

^Q  —  1  —  p0 ,  p  is  the  sample  proportion,  and  n  is  the  sample  size.  Remember  that 
the  condition  that  the  sample  be  large  is  not  that  n  be  at  least  30  but  that  the 
interval 


P~ 


3 


,P  +  3 


lie  wholly  within  the  interval  [0,  l]  . 
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Standardized  Test  Statistic  for  Large  Sample  Hypothesis 
Tests  Concerning  a  Single  Population  Proportion 


The  test  statistic  has  the  standard  normal  distribution. 


The  distribution  of  the  standardized  test  statistic  and  the  corresponding  rejection 
region  for  each  form  of  the  alternative  hypothesis  (left-tailed,  right-tailed,  or  two- 
tailed),  is  shown  in  Figure  8.14  "Distribution  of  the  Standardized  Test  Statistic  and 
the  Rejection  Region". 


Figure  8.14  Distribution  of  the  Standardized  Test  Statistic  and  the  Rejection  Region 


Ha  -p  <p o 


Ha  :  p  >  p0 


Reject  H0 


Ha  :p*  pa 


Reject  H0 


Reject  Hq 
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EXAMPLE  12 


A  soft  drink  maker  claims  that  a  majority  of  adults  prefer  its  leading 
beverage  over  that  of  its  main  competitor’s.  To  test  this  claim  500  randomly 
selected  people  were  given  the  two  beverages  in  random  order  to  taste. 
Among  them,  270  preferred  the  soft  drink  maker’s  brand,  211  preferred  the 
competitor’s  brand,  and  19  could  not  make  up  their  minds.  Determine 
whether  there  is  sufficient  evidence,  at  the  5%  level  of  significance,  to 
support  the  soft  drink  maker’s  claim  against  the  default  that  the  population 
is  evenly  split  in  its  preference. 

Solution: 


We  will  use  the  critical  value  approach  to  perform  the  test.  The  same  test 
will  be  performed  using  the  p-value  approach  in  Note  8.49  "Example  14". 


We  must  check  that  the  sample  is  sufficiently  large  to  validly  perform  the 

test.  Since  p  =  270  /  500  =  0.54  , 


hence 


(0.54)  (0.46) 
500 


0.02 


-  [0.54  -  (3)  (0.02), 0.54+  (3)  (0.02)] 

-  [0.48,0.60]  C  [0,1] 


so  the  sample  is  sufficiently  large. 


Step  1.  The  relevant  test  is 


H0  :p  =  0.50 

vs .Ha  :  p  >  0.50  @  a  =  0.05 
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where  p  denotes  the  proportion  of  all  adults  who  prefer  the 
company’s  beverage  over  that  of  its  competitor’s  beverage. 

•  Step  2.  The  test  statistic  is 


z  = 


p-p  0 


P  o  <7o 


and  has  the  standard  normal  distribution. 


Step  3.  The  value  of  the  test  statistic  is 


z=PZP<L=  0.54-0.50  =L7g9 


/  (0.50)  (0.50) 

V  500 

is  “>”  this  is  a  right-tailed  test,  so  there  is 
a  single  critical  value,  Za  —  Z0.05  •  Reading  from  the  last  line  in  Figure 
12.3  "Critical  Values  of"  its  value  is  1.645.  The  rejection  region  is 

[1.645,  oo)  . 

•  Step  5.  As  shown  in  Figure  8.15  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  a  majority  of  adults  prefer  the 
company’s  beverage  to  that  of  their  competitor’s. 


P  o  “7  o 


Step  4.  Since  the  symbol  in  Ha 
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Figure  8.15 

Rejection  Region  and 
Test  Statistic  for  Note 
8.47  "Example  12" 

Ha:p>  0.5 
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Globally  the  long-term  proportion  of  newborns  who  are  male  is  51.46%.  A 
researcher  believes  that  the  proportion  of  boys  at  birth  changes  under 
severe  economic  conditions.  To  test  this  belief  randomly  selected  birth 
records  of  5,000  babies  born  during  a  period  of  economic  recession  were 
examined.  It  was  found  in  the  sample  that  52.55%  of  the  newborns  were 
boys.  Determine  whether  there  is  sufficient  evidence,  at  the  10%  level  of 
significance,  to  support  the  researcher’s  belief. 


Solution: 


We  will  use  the  critical  value  approach  to  perform  the  test.  The  same  test 
will  be  performed  using  the  p-value  approach  in  Note  8.50  "Example  15". 

The  sample  is  sufficiently  large  to  validly  perform  the  test  since 


hence 


[0.5255  -  0. 03,0.  5255  +  0.03] 
[0.4955,0.5555]  c  [0,l] 


•  Step  1.  Let  p  be  the  true  proportion  of  boys  among  all  newborns 
during  the  recession  period.  The  burden  of  proof  is  to  show  that 
severe  economic  conditions  change  it  from  the  historic  long¬ 
term  value  of  0.5146  rather  than  to  show  that  it  stays  the  same, 
so  the  hypothesis  test  is 


Hq  :  p  —  0.5146 
vs .Ha:p  ±  0.5146  @a  =  0.10 
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•  Step  2.  The  test  statistic  is 


z  = 


p-p  0 


and  has  the  standard  normal  distribution. 


•  Step  3.  The  value  of  the  test  statistic  is 


z  = 


P-Po  0.5255  -0.5146 


=  1.542 


(0.5146)  (0.4854) 
5000 


•  Step  4.  Since  the  symbol  in  Ha  is  this  is  a  two-tailed  test,  so  there  are 
a  pair  of  critical  values,  ±Za/2  —  ±Zo.05  —  ±1-645.  The  rejection 
region  is  ( —  oo,  —  1.645]  u  [1.645  ,  oo)  . 

•  Step  5.  As  shown  in  Figure  8.16  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  does  not  fall  in  the  rejection 
region.  The  decision  is  not  to  reject  Ho.  In  the  context  of  the 
problem  our  conclusion  is: 


The  data  do  not  provide  sufficient  evidence,  at  the  10%  level  of 
significance,  to  conclude  that  the  proportion  of  newborns  who 
are  male  differs  from  the  historic  proportion  in  times  of 
economic  recession. 
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Figure  8.16 

Rejection  Region  and 
Test  Statistic  for  Note 
8.48  "Example  13" 

Ha:p=fi  0.5146 
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Perform  the  test  of  Note  8.47  "Example  12"  using  the  p-value  approach. 


Solution: 


We  already  know  that  the  sample  size  is  sufficiently  large  to  validly  perform 
the  test. 

•  Steps  1-3  of  the  five-step  procedure  described  in  Section  8.3.2  "The  " 
have  already  been  done  in  Note  8.47  "Example  12"  so  we  will  not  repeat 
them  here,  but  only  say  that  we  know  that  the  test  is  right-tailed  and 
that  value  of  the  test  statistic  is  Z  =  1.789. 

•  Step  4.  Since  the  test  is  right-tailed  the  p-value  is  the  area  under  the 
standard  normal  curve  cut  off  by  the  observed  test  statistic,  z  =  1.789,  as 
illustrated  in  Figure  8.17.  By  Figure  12.2  "Cumulative  Normal 
Probability"  that  area  and  therefore  the  p-value  is 


1  -  0.9633  =  0.0367. 


•  Step  5.  Since  the  p-value  is  less  than  a  =  0.05  the  decision  is  to  reject 


H0. 


Figure  8.17 

P-Value  for  Note  8.49 
"Example  14" 


area  —  0.0367 


0  Z  =  1.789 
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EXAMPLE  15 


Perform  the  test  of  Note  8.48  "Example  13"  using  the  p-value  approach. 
Solution: 

We  already  know  that  the  sample  size  is  sufficiently  large  to  validly  perform 
the  test. 

•  Steps  1-3  of  the  five-step  procedure  described  in  Section  8.3.2  "The  " 
have  already  been  done  in  Note  8.48  "Example  13".  They  tell  us  that  the 
test  is  two-tailed  and  that  value  of  the  test  statistic  is  Z  =  1.542. 

•  Step  4.  Since  the  test  is  two-tailed  the  p-value  is  the  double  of  the  area 
under  the  standard  normal  curve  cut  off  by  the  observed  test  statistic,  z 
=  1.542.  By  Figure  12.2  "Cumulative  Normal  Probability"  that  area  is 

1  —  0.9382  =  0.0618  ,  as  illustrated  in  Figure  8.18,  hence  the  p- 
value  is  2  X  0.0618  =  0.1236. 

•  Step  5.  Since  the  p-value  is  greater  than  a  =  0.10  the  decision  is  not 
to  reject  Hq. 


Figure  8.18 

P-Value  for  Note  8.50 
"Example  15" 

fla  :  p  7^  0.5146 


area  =  0.0618 


KEY  TAKEAWAYS 


•  There  is  one  formula  for  the  test  statistic  in  testing  hypotheses  about  a 
population  proportion.  The  test  statistic  follows  the  standard  normal 
distribution. 

•  Either  five-step  procedure,  critical  value  or  p-value  approach,  can  be 
used. 
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On  all  exercises  for  this  section  you  may  assume  that  the  sample  is  sufficiently  large 
for  the  relevant  test  to  he  validly  performed. 

1.  Compute  the  value  of  the  test  statistic  for  each  test  using  the  information 
given. 

a.  Testing  Hq  p  =  0.50  vs.  Ha  p  >  0.50,  n  =  360, p  =  0.56. 

b.  Testing  Hq  :  p  =  0.50  vs.  Ha  p  ^  0.50,  n  =  360, p  =  0.56. 

c.  Testing//o  :  P  —  0.37  vs. Ha  :  p  <  0.37,  n  =  1200, p  =  0.35. 

2.  Compute  the  value  of  the  test  statistic  for  each  test  using  the  information 
given. 

a.  Testing//))  :  p  =  0.72vs.//a  :  p  <  0.72,  n  =  2100, p  =  0.71. 

b.  Testing  Hq  p  =  0.83  vs.  Ha  \  p  ^  0.83,  n  =  500, p  =  0.86. 

c.  Testing  Hq  p  =  0.22  vs.  Ha  p  <  0.22,  n  =  750, p  =  0.18. 

3.  For  each  part  of  Exercise  1  construct  the  rejection  region  for  the  test  for 

a  =  0.05  and  make  the  decision  based  on  your  answer  to  that  part  of  the 
exercise. 

4.  For  each  part  of  Exercise  2  construct  the  rejection  region  for  the  test  for 

a  =  0.05  and  make  the  decision  based  on  your  answer  to  that  part  of  the 
exercise. 

5.  For  each  part  of  Exercise  1  compute  the  observed  significance  (p-value)  of  the 
test  and  compare  it  to  a  =  0.05  in  order  to  make  the  decision  by  the  p-value 
approach  to  hypothesis  testing. 

6.  For  each  part  of  Exercise  2  compute  the  observed  significance  (p-value)  of  the 
test  and  compare  it  to  a  =  0.05  in  order  to  make  the  decision  by  the  p-value 
approach  to  hypothesis  testing. 

7.  Perform  the  indicated  test  of  hypotheses  using  the  critical  value  approach. 

a.  Testing  Hq  p  =  0.55  vs.  Ha  p  >  0.55  @  (X  =  0.05,  n  =  300, 

p  =  0.60. 

b.  Testing  Hq  p  =  0.47  vs.  Ha  p  ^  0.47  @a  =  0.01,n  =  9750, 
p  =  0.46. 

8.  Perform  the  indicated  test  of  hypotheses  using  the  critical  value  approach. 
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a.  Testing  Hq  p  =  0.15  vs.  Ha  p  7^  0. 15  @  Ct  =  0.001 ,  n  =  1600, 
p  =  0.18. 

b.  Testing  Hq  p  =  0.90  vs.  Ha  p>  0.90  @  CL  =  0.01 ,  n  =  1100, 
p  =  0.91. 

9.  Perform  the  indicated  test  of  hypotheses  using  the  p-value  approach. 

a.  Testing  Hq  :  p  =  0.37  vs.  Ha  p  7^  0.37  @  a  =  0.005 ,  n  =  1300, 
p  =  0.40. 

b.  Testing  Hq  p  =  0.94  vs.  Ha  \  p  >  0.94  @  a  =  0.05,  n  =  1200, 
p  =  0.96. 

10.  Perform  the  indicated  test  of  hypotheses  using  the  p-value  approach. 

a.  Testing  Hq  p  =  0.25  vs.  Ha  \  p  <  0.25  @  a  =  0.10,n  =  850, 
p  =  0.23. 

b.  Testing  Hq  p  =  0.33  vs.  Ha  p  7^  0.33  @  (X  =  0.05,  n  =  1100, 
p  =  0.30. 


APPLICATIONS 


11.  Five  years  ago  3.9%  of  children  in  a  certain  region  lived  with  someone  other 
than  a  parent.  A  sociologist  wishes  to  test  whether  the  current  proportion  is 
different.  Perform  the  relevant  test  at  the  5%  level  of  significance  using  the 
following  data:  in  a  random  sample  of  2,759  children,  119  lived  with  someone 
other  than  a  parent. 

12.  The  government  of  a  particular  country  reports  its  literacy  rate  as  52%.  A 
nongovernmental  organization  believes  it  to  be  less.  The  organization  takes  a 
random  sample  of  600  inhabitants  and  obtains  a  literacy  rate  of  42%.  Perform 
the  relevant  test  at  the  0.5%  (one-half  of  1%)  level  of  significance. 

13.  Two  years  ago  72%  of  household  in  a  certain  county  regularly  participated  in 
recycling  household  waste.  The  county  government  wishes  to  investigate 
whether  that  proportion  has  increased  after  an  intensive  campaign  promoting 
recycling.  In  a  survey  of  900  households,  674  regularly  participate  in  recycling. 
Perform  the  relevant  test  at  the  10%  level  of  significance. 

14.  Prior  to  a  special  advertising  campaign,  23%  of  all  adults  recognized  a 
particular  company’s  logo.  At  the  close  of  the  campaign  the  marketing 
department  commissioned  a  survey  in  which  311  of  1,200  randomly  selected 
adults  recognized  the  logo.  Determine,  at  the  1%  level  of  significance,  whether 
the  data  provide  sufficient  evidence  to  conclude  that  more  than  23%  of  all 
adults  now  recognize  the  company’s  logo. 
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15.  A  report  five  years  ago  stated  that  35.5%  of  all  state-owned  bridges  in  a 
particular  state  were  “deficient.”  An  advocacy  group  took  a  random  sample  of 
100  state-owned  bridges  in  the  state  and  found  33  to  be  currently  rated  as 
being  “deficient.”  Test  whether  the  current  proportion  of  bridges  in  such 
condition  is  35.5%  versus  the  alternative  that  it  is  different  from  35.5%,  at  the 
10%  level  of  significance. 

16.  In  the  previous  year  the  proportion  of  deposits  in  checking  accounts  at  a 
certain  bank  that  were  made  electronically  was  45%.  The  bank  wishes  to 
determine  if  the  proportion  is  higher  this  year.  It  examined  20,000  deposit 
records  and  found  that  9,217  were  electronic.  Determine,  at  the  1%  level  of 
significance,  whether  the  data  provide  sufficient  evidence  to  conclude  that 
more  than  45%  of  all  deposits  to  checking  accounts  are  now  being  made 
electronically. 

17.  According  to  the  Federal  Poverty  Measure  12%  of  the  U.S.  population  lives  in 
poverty.  The  governor  of  a  certain  state  believes  that  the  proportion  there  is 
lower.  In  a  sample  of  size  1,550, 163  were  impoverished  according  to  the 
federal  measure. 

a.  Test  whether  the  true  proportion  of  the  state’s  population  that  is 
impoverished  is  less  than  12%,  at  the  5%  level  of  significance. 

b.  Compute  the  observed  significance  of  the  test. 

18.  An  insurance  company  states  that  it  settles  85%  of  all  life  insurance  claims 
within  30  days.  A  consumer  group  asks  the  state  insurance  commission  to 
investigate.  In  a  sample  of  250  life  insurance  claims,  203  were  settled  within  30 
days. 

a.  Test  whether  the  true  proportion  of  all  life  insurance  claims  made  to  this 
company  that  are  settled  within  30  days  is  less  than  85%,  at  the  5%  level  of 
significance. 

b.  Compute  the  observed  significance  of  the  test. 

19.  A  special  interest  group  asserts  that  90%  of  all  smokers  began  smoking  before 
age  18.  In  a  sample  of  850  smokers,  687  began  smoking  before  age  18. 

a.  Test  whether  the  true  proportion  of  all  smokers  who  began  smoking 
before  age  18  is  less  than  90%,  at  the  1%  level  of  significance. 

b.  Compute  the  observed  significance  of  the  test. 

20.  In  the  past,  68%  of  a  garage’s  business  was  with  former  patrons.  The  owner  of 
the  garage  samples  200  repair  invoices  and  finds  that  for  only  114  of  them  the 
patron  was  a  repeat  customer. 
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a.  Test  whether  the  true  proportion  of  all  current  business  that  is  with  repeat 
customers  is  less  than  68%,  at  the  1%  level  of  significance. 

b.  Compute  the  observed  significance  of  the  test. 


ADDITIONAL  EXERCISES 


21.  A  rule  of  thumb  is  that  for  working  individuals  one-quarter  of  household 
income  should  be  spent  on  housing.  A  financial  advisor  believes  that  the 
average  proportion  of  income  spent  on  housing  is  more  than  0.25.  In  a  sample 
of  30  households,  the  mean  proportion  of  household  income  spent  on  housing 
was  0.285  with  a  standard  deviation  of  0.063.  Perform  the  relevant  test  of 
hypotheses  at  the  1%  level  of  significance.  Hint:  This  exercise  could  have  been 
presented  in  an  earlier  section. 

22.  Ice  cream  is  legally  required  to  contain  at  least  10%  milk  fat  by  weight.  The 
manufacturer  of  an  economy  ice  cream  wishes  to  be  close  to  the  legal  limit, 
hence  produces  its  ice  cream  with  a  target  proportion  of  0.106  milk  fat.  A 
sample  of  five  containers  yielded  a  mean  proportion  of  0.094  milk  fat  with 
standard  deviation  0.002.  Test  the  null  hypothesis  that  the  mean  proportion  of 
milk  fat  in  all  containers  is  0.106  against  the  alternative  that  it  is  less  than 
0.106,  at  the  10%  level  of  significance.  Assume  that  the  proportion  of  milk  fat  in 
containers  is  normally  distributed.  Hint:  This  exercise  could  have  been 
presented  in  an  earlier  section. 


LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Sets  4  and  4A  list  the  results  of  500  tosses  of  a  die.  Let  p  denote  the 
proportion  of  all  tosses  of  this  die  that  would  result  in  a  five.  Use  the  sample 
data  to  test  the  hypothesis  that  p  is  different  from  l/6,  at  the  20%  level  of 
significance. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4A.xls 

24.  Large  Data  Set  6  records  results  of  a  random  survey  of  200  voters  in  each  of  two 
regions,  in  which  they  were  asked  to  express  whether  they  prefer  Candidate  A 
for  a  U.S.  Senate  seat  or  prefer  some  other  candidate.  Use  the  full  data  set  (400 
observations)  to  test  the  hypothesis  that  the  proportion  p  of  all  voters  who 
prefer  Candidate  A  exceeds  0.35.  Test  at  the  10%  level  of  significance. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data6.xls 
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25.  Lines  2  through  536  in  Large  Data  Set  11  is  a  sample  of  535  real  estate  sales  in  a 
certain  region  in  2008.  Those  that  were  foreclosure  sales  are  identified  with  a  1 
in  the  second  column.  Use  these  data  to  test,  at  the  10%  level  of  significance, 
the  hypothesis  that  the  proportion  p  of  all  real  estate  sales  in  this  region  in 
2008  that  were  foreclosure  sales  was  less  than  25%.  (The  null  hypothesis  is 

H0  :  p  =  0.25.) 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datall.xls 

26.  Lines  537  through  1106  in  Large  Data  Set  11  is  a  sample  of  570  real  estate  sales 
in  a  certain  region  in  2010.  Those  that  were  foreclosure  sales  are  identified 
with  a  1  in  the  second  column.  Use  these  data  to  test,  at  the  5%  level  of 
significance,  the  hypothesis  that  the  proportion  p  of  all  real  estate  sales  in  this 
region  in  2010  that  were  foreclosure  sales  was  greater  than  23%.  (The  null 
hypothesis  is  Ho  '■  p  =  0.23.) 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datall.xls 
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ANSWERS 


1.  a.  Z  =  2.277 

b.  Z  =  2.277 

c.  Z  =  -1.435 

3.  a.  Z  2  1.645;  reject  Ho. 

b.  Z  <  —1.96  or  Z  2  1.96;  reject  Ho. 

c.  Z  <  —1.645  ;  do  not  reject  Ho. 

5.  a.  P  -value  =  0.0116  ,a  =  0.05;  reject  Ho. 

b.  P  -value  =  0.0232  ,a  =  0.05;  reject  Ho. 

c.  P  -value  =  0.0749  ,a  =  0.05;  do  not  reject  Ho. 

7.  a.  Z  =  1.74,  Zo.05  —  1-645 ,  reject  Ho. 

b.  Z  =  — 1.98  ,  — Zo.005  —  —2.576 ,  do  not  reject  Ho. 

9.  a.  Z  =  2.24, 7? -value  =  0.025  ,a  =  0.005  ,  do  not  reject  Ho. 
b.  Z  =  2.92,  P -value  =  0.0018  ,  a  =  0.05,  reject  Ho. 

11.  Z  =  1.11,  Zo.025  —  1.96,  do  not  reject  Ho. 

13.  Z  =  1.93,  Zo.10  —  1-28,  reject  Ho. 

15.  Z  =  -0.523  ,  ±Zo.05  —  ±1-645  ,  do  not  reject  Ho. 

17.  a.  Z  =  — 1.798  ,  — Z0.05  —  -1-645  ,  reject  Ho; 
b.  p  — value  =  0.0359. 

19.  a.  Z  =  —  8.92,—  Zo.Ol  —  —2.33 ,  reject  Ho; 
b.  p  -value  «  0. 

21.  Z  =  3.04,  Zo.01  =  2.33,  reject  Ho. 

23.  Hq  :  p  =  1  /  6vs.Ha  :  P  ^  1  /  6.  Test  Statistic:  Z  =  —0.76. 

Rejection  Region:  (—  oo,  —  1.28]  U  [1.28,  oo)  .  Decision:  Fail  to  reject  Ho. 

25.  Hq  :  p  =  0.25  vs.  Ha  :  P  <  0.25.  Test  Statistic:  Z  =  —1.17.  Rejection 
Region:  (—oo, —1.28  .  Decision:  Fail  to  reject  Hq. 
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Two-Sample  Problems 


The  previous  two  chapters  treated  the  questions  of  estimating  and  making 
inferences  about  a  parameter  of  a  single  population.  In  this  chapter  we  consider  a 
comparison  of  parameters  that  belong  to  two  different  populations.  For  example, 
we  might  wish  to  compare  the  average  income  of  all  adults  in  one  region  of  the 
country  with  the  average  income  of  those  in  another  region,  or  we  might  wish  to 
compare  the  proportion  of  all  men  who  are  vegetarians  with  the  proportion  of  all 
women  who  are  vegetarians. 


We  will  study  construction  of  confidence  intervals  and  tests  of  hypotheses  in  four 
situations,  depending  on  the  parameter  of  interest,  the  sizes  of  the  samples  drawn 
from  each  of  the  populations,  and  the  method  of  sampling.  We  also  examine  sample 
size  considerations. 
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9.1  Comparison  of  Two  Population  Means:  Large,  Independent  Samples 


LEARNING  OBJECTIVES 

1.  To  understand  the  logical  framework  for  estimating  the  difference 
between  the  means  of  two  distinct  populations  and  performing  tests  of 
hypotheses  concerning  those  means. 

2.  To  learn  how  to  construct  a  confidence  interval  for  the  difference  in  the 
means  of  two  distinct  populations  using  large,  independent  samples. 

3.  To  learn  how  to  perform  a  test  of  hypotheses  concerning  the  difference 
between  the  means  of  two  distinct  populations  using  large,  independent 
samples. 


Suppose  we  wish  to  compare  the  means  of  two  distinct  populations.  Figure  9.1 
"Independent  Sampling  from  Two  Populations"  illustrates  the  conceptual 
framework  of  our  investigation  in  this  and  the  next  section.  Each  population  has  a 
mean  and  a  standard  deviation.  We  arbitrarily  label  one  population  as  Population  1 
and  the  other  as  Population  2,  and  subscript  the  parameters  with  the  numbers  1 
and  2  to  tell  them  apart.  We  draw  a  random  sample  from  Population  1  and  label  the 
sample  statistics  it  yields  with  the  subscript  1.  Without  reference  to  the  first  sample 
we  draw  a  sample  from  Population  2  and  label  its  sample  statistics  with  the 
subscript  2. 
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Figure  9.1  Independent  Sampling  from  Two  Populations 


Population  1  Population  2 


mean :  //i 

mean :  /12 

s.d a  1 

s.d.:  (72 

\/ 

Sample  1 


\/ 

Sample  2 


size :  n\ 

size:  n2 

mean:  x\ 

mean:  X2 

s.d.:  s  1 

s.d.:  S2 

Definition 

Samples  from  two  distinct  populations  are  independent  if  each  one  is  drawn  without 
reference  to  the  other,  and  has  no  connection  with  the  other. 


Our  goal  is  to  use  the  information  in  the  samples  to  estimate  the  difference  ft1  —  pt2 
in  the  means  of  the  two  populations  and  to  make  statistically  valid  inferences  about 
it. 

Confidence  Intervals 

Since  the  mean  X  \  of  the  sample  drawn  from  Population  1  is  a  good  estimator  of  pi j 
and  the  mean  x2  of  the  sample  drawn  from  Population  2  is  a  good  estimator  of  pi2,  a 
reasonable  point  estimate  of  the  difference  fl\  —  pi2  is  X  \  —  x2  ■  In  order  to  widen 
this  point  estimate  into  a  confidence  interval,  we  first  suppose  that  both  samples 
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are  large,  that  is,  that  both  n\  >  30  and  fin  >  30.  if  so,  then  the  following  formula 
for  a  confidence  interval  for  //1  —  /d2  is  valid.  The  symbols  .vj  and  denote  the 
squares  of  si  and  S2.  (in  the  relatively  rare  case  that  both  population  standard 
deviations  o\  and  o2  are  known  they  would  be  used  instead  of  the  sample  standard 
deviations.) 


100  (1  —  Ol)  %  Confidence  Interval  for  the  Difference 
Between  Two  Population  Means:  Large,  Independent 
Samples 


Ol  ~X2)  ±  Za/2 


n  1  n  2 


The  samples  must  be  independent,  and  each  sample  must  be  large:  il\  >  30  and 
m  >  30. 
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EXAMPLE  1 


To  compare  customer  satisfaction  levels  of  two  competing  cable  television 
companies,  174  customers  of  Company  1  and  355  customers  of  Company  2 
were  randomly  selected  and  were  asked  to  rate  their  cable  companies  on  a 
five-point  scale,  with  1  being  least  satisfied  and  5  most  satisfied.  The  survey 
results  are  summarized  in  the  following  table: 


Company  1 

Company  2 

nx  =  174 

n2  —  355 

xi  =  3.51 

x2  =  3.24 

.s',  =  0.51 

s2  =  0.52 

Construct  a  point  estimate  and  a  99%  confidence  interval  for  fl\  —  the 
difference  in  average  satisfaction  levels  of  customers  of  the  two  companies 
as  measured  on  this  five-point  scale. 

Solution: 

The  point  estimate  of  j  —  fi2  is 

x ,  -  x2  =  3.51  -  3.24  =  0.27. 

In  words,  we  estimate  that  the  average  customer  satisfaction  level  for 
Company  1  is  0.27  points  higher  on  this  five-point  scale  than  it  is  for 
Company  2. 

To  apply  the  formula  for  the  confidence  interval,  proceed  exactly  as  was 
done  in  Chapter  7  "Estimation".  The  99%  confidence  level  means  that 
a  =  1  —  0.99  =  0.01  sothatCa/2  —  Co  nos  .  From  Figure  12.3 
"Critical  Values  of "  we  read  directly  that  Co  00s  =  2.576.  Thus 


(Xi  -X2)±Za/ 2 


o2  „2 

—  4-  —  -  0.27  ±  2.576 


nx 


n2 


+ 


0.522 

^55" 


=  0.27  ± 


We  are  99%  confident  that  the  difference  in  the  population  means  lies  in  the 
interval  [O.  15,0.  39 j  ,  in  the  sense  that  in  repeated  sampling  99%  of  all 
intervals  constructed  from  the  sample  data  in  this  manner  will  contain 
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H  i  —  H2  •  In  the  context  of  the  problem  we  say  we  are  99%  confident  that 
the  average  level  of  customer  satisfaction  for  Company  1  is  between  0.15  and 
0.39  points  higher,  on  this  five-point  scale,  than  that  for  Company  2. 


Hypothesis  Testing 

Hypotheses  concerning  the  relative  sizes  of  the  means  of  two  populations  are  tested 
using  the  same  critical  value  and  p-value  procedures  that  were  used  in  the  case  of  a 
single  population.  All  that  is  needed  is  to  know  how  to  express  the  null  and 
alternative  hypotheses  and  to  know  the  formula  for  the  standardized  test  statistic 
and  the  distribution  that  it  follows. 


The  null  and  alternative  hypotheses  will  always  be  expressed  in  terms  of  the 
difference  of  the  two  population  means.  Thus  the  null  hypothesis  will  always  be 
written 


Ho  :  ft  1-1*2=  A) 

where  Do  is  a  number  that  is  deduced  from  the  statement  of  the  situation.  As  was 
the  case  with  a  single  population  the  alternative  hypothesis  can  take  one  of  the 
three  forms,  with  the  same  terminology: 


Form  of  Ha 

Terminology 

Ha  :  yUj  -/42  <  D0 

Left-tailed 

Ha  \  li  J  —  yU2  >  Do 

Right-tailed 

Ha  •  l^\  ~  ft 2  7^  Dq 

Two-tailed 

As  long  as  the  samples  are  independent  and  both  are  large  the  following  formula  for 
the  standardized  test  statistic  is  valid,  and  it  has  the  standard  normal  distribution, 
(in  the  relatively  rare  case  that  both  population  standard  deviations  o\  and  02  are 
known  they  would  be  used  instead  of  the  sample  standard  deviations.) 
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Standardized  Test  Statistic  for  Hypothesis  Tests 
Concerning  the  Difference  Between  Two  Population 
Means:  Large,  Independent  Samples 

z  _  (xi  -  x2)  -  Dp 

V  n  i  n  2 

The  test  statistic  has  the  standard  normal  distribution. 

The  samples  must  be  independent,  and  each  sample  must  be  large:  n  \  >  30  and 
m  >  30. 
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Refer  to  Note  9.4  "Example  1"  concerning  the  mean  satisfaction  levels  of 
customers  of  two  competing  cable  television  companies.  Test  at  the  1%  level 
of  significance  whether  the  data  provide  sufficient  evidence  to  conclude  that 
Company  1  has  a  higher  mean  satisfaction  rating  than  does  Company  2.  Use 
the  critical  value  approach. 


Solution: 


•  Step  1.  If  the  mean  satisfaction  levels  pi  y  and  pi2  are  the  same 
then  / l  y  =  pi2,  but  we  always  express  the  null  hypothesis  in 
terms  of  the  difference  between  ply  and  pi2,  hence  Ho  is 
pi  l  —  pi2  =  O.To  say  that  the  mean  customer  satisfaction  for 
Company  1  is  higher  than  that  for  Company  2  means  that 
pi  j  >  pi2,  which  in  terms  of  their  difference  is  pi  y  —  pi2  >  0. 
The  test  is  therefore 


Ho  :  pi  y  -  pi2  =  0 
vs.  Ha  :  piy  —  pi2  >  0  @  a  =  0.01 


•  Step  2.  Since  the  samples  are  independent  and  both  are  large  the 
test  statistic  is 


Ol  -x2)  -D0 


z  = 


•  Step  3.  Inserting  the  data  into  the  formula  for  the  test  statistic 


gives 


z  = 


(J,  -  x2)  -  Do  (3.51- 3.24) -0 


=  5.684 


9.1  Comparison  of  Two  Population  Means:  Large,  Independent  Samples 


450 


Chapter  9  Two-Sample  Problems 


•  Step  4.  Since  the  symbol  in  Ha  is  “>”  this  is  a  right-tailed  test,  so 
there  is  a  single  critical  value,  Za  —  Zo.01  »  which  from  the  last 
line  in  Figure  12.3  "Critical  Values  of  "  we  read  off  as  2.326.  The 

rejection  region  is  2.326,  coj  . 


Figure  9.2 

Rejection  Region  and  Test  Statistic  for  Note  9.6  "Example  2" 

Ha  :  Mi-^2  >  0 


•  Step  5.  As  shown  in  Figure  9.2  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  customer  satisfaction  for 
Company  1  is  higher  than  that  for  Company  2. 
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EXAMPLE  3 


Perform  the  test  of  Note  9.6  "Example  2"  using  the  p-value  approach. 

Solution: 

The  first  three  steps  are  identical  to  those  in  Note  9.6  "Example  2". 

•  Step  4.  The  observed  significance  or  p-value  of  the  test  is  the  area  of  the 
right  tail  of  the  standard  normal  distribution  that  is  cut  off  by  the  test 
statistic  Z  =  5.684.  The  number  5.684  is  too  large  to  appear  in  Figure  12.2 
"Cumulative  Normal  Probability",  which  means  that  the  area  of  the  left 
tail  that  it  cuts  off  is  1.0000  to  four  decimal  places.  The  area  that  we 
seek,  the  area  of  the  right  tail,  is  therefore  1  —  1 .0000  =  0.0000  to 
four  decimal  places.  See  Figure  9.3.  That  is,  p  -value  =  0.0000  to 
four  decimal  places.  (The  actual  value  is  approximately 

0.000000007.  ) 


Figure  9.3 

P- Value  for  Note  9.7  "Examvle3" 

area  = 

in  111  p 

area  «  1.0000  «  0.000 


-1 - I— ^ 

0 

Z  =  5.684 


•  Step  5.  Since  0.0000  <  0.01,  p  -value  <  a  so  the  decision  is  to 
reject  the  null  hypothesis: 

The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  customer  satisfaction  for 
Company  1  is  higher  than  that  for  Company  2. 
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KEY  TAKEAWAYS 


•  A  point  estimate  for  the  difference  in  two  population  means  is  simply 
the  difference  in  the  corresponding  sample  means. 

•  In  the  context  of  estimating  or  testing  hypotheses  concerning  two 
population  means,  “large”  samples  means  that  both  samples  are  large. 

•  A  confidence  interval  for  the  difference  in  two  population  means  is 
computed  using  a  formula  in  the  same  fashion  as  was  done  for  a  single 
population  mean. 

•  The  same  five-step  procedure  used  to  test  hypotheses  concerning  a 
single  population  mean  is  used  to  test  hypotheses  concerning  the 
difference  between  two  population  means.  The  only  difference  is  in  the 
formula  for  the  standardized  test  statistic. 
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1.  Construct  the  confidence  interval  for  pi  j  —  pi2  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  90%  confidence, 

n\  =  45,  x\  =  27, si  =  2 
=  60,  x  2  =  22,  s2  =  3 

b.  99%  confidence, 

n\  =  30,  J  i  =  —1 12,  s\  =  9 
n2  =  40,  x2  =  -98,  s2  =  4 

2.  Construct  the  confidence  interval  for  pi  j  —  //2  f°r  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  95%  confidence, 

n  i  =  1 10,  J|  =  11, si  =  15 
n2  =  85,  x2  =  19, s2  =  21 

b.  90%  confidence, 

n\  =  65,  x\  =  —83, si  =  12 
n2  =  65,  x2  =  -74,  s2  =  8 

3.  Construct  the  confidence  interval  for  fl  j  —  pi  2  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  99.5%  confidence, 

n\  —  130,  x\  —  27.2, si  =  2.5 
n2  —  155,  x2  =  38.8, s2  =  4.6 

b.  95%  confidence, 

n\  =  68,  xi  =  215.5, si  =  12.3 
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n.2  =  84,  X2  =  287.8,52  =  14.1 

4.  Construct  the  confidence  interval  for  fl  j  —  //2  f°r  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  99.9%  confidence, 

n\  —  275,  x\  =  70.2, =  1.5 
ri2  =  325,  x~2  =  63.4,  ^2  =  1-1 

b.  90%  confidence, 

n  i  =  120,  x\  =  35.5,  5i  =  0.75 
fi2  =  146, J2  =  29.6,  ^2  —  0-80 

5.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach.  Compute  the  p-value  of  the  test 
as  well. 

a.  Test//o  •  AO  —  ^2  =  3vs ,Ha  :  p/j  —  ^2  7^  3@  a  =  0.05, 
n\  =  35,  Ji  =  25,.si  =  1 

n2  =  45, x2  =  19,52  =  2 

b.  Test //o  :  —  yw2  =  — 25vs .Ha  \  —  fi2  <  —25@a  —  0.10, 

=  85, Ji  =  188,5j  =  15 
n2  =  62,  x  2  =  215,52  =  19 

6.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach.  Compute  the  p-value  of  the  test 
as  well. 

a.  Test77o  •  AO  —  1*2  =  45vs.  Ha  p/j  —  ft2  >  45@a  =  0.001, 
n\  —  200, xi  =  1312,5i  =  35 

ri2  =  225, X2  =  1256,52  =  28 

b.  Test//o  :  p/j  —  fi2  =  — 12vs ,Ha  \  \iy  —  ix2  ^  — 12@ a  =  0.10, 
n  i  =  35, xi  =  121,5i  =  6 

«2  =  40, X2  =  135,52  =  7 


9.1  Comparison  of  Two  Population  Means:  Large,  Independent  Samples 


455 


Chapter  9  Two-Sample  Problems 


7.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach.  Compute  the  p-value  of  the  test 
as  well. 

a.  Test//])  :  1*  \  ~  ^ 2  ~  Ovs.  Ha  '■  pi  \  —  ^2  ^  0@  a  —  0.01, 
n\  =  125,  x\  =  —46,5]  =  10 

72 2  =  90,  X  2  =  —50,  .S'2  =  13 

b.  Test//o  :  pi\  —  pi2  —  20vs.  Ha  \  pi  j  —  pi2  >  20@  a  —  0.05, 

72  1  =  40,  x\  =  142,  =11 

«2  =  40,  X2  =  118,52  —  10 

8.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach.  Compute  the  p-value  of  the  test 
as  well. 

a.  Test Hq  :  pi j  —  pi2  —  1 3vs. Ha  pi^  ~  ^2  <  13@ a  =  0.01 , 

72  1  =  35,  x\  =  100,5i  =  2 

722  =  35,  X2  =  88,52  =  2 

b.  Test//])  ■  f^i  ~  1^2  =  ~  lOvs./i/a  :  yl/j  —  p/2  7^  — 10@  a  =  0.10, 

72 1  =  146,  xj  =  62, 5i  =  4 

722  =  120,  X2  =  73,52  =  7 

9.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach. 

a.  Test//o  •  ^1  —  1*2  =  57vs ,Ha  :  pi  ^  <  57@a  =  0.10, 

72j  =  117,  xi  =  1309,5]  =  42 

722  =  133,  X2  —  1258,52  =  37 

b.  Test  Hq  :  pi{  —  pi2  =  —  1.5vs.//a  \  pix  —  pi2  ^  — 1.5@cr  =  0.20, 
72]  =  65,  xi  =  16.9,5]  =  1.3 

722  —  57,^2  =  18.6,52  =  1.1 

10.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach. 
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a.  JestHo  \  Hi  —  fi 2~  — 10.5vs ,Ha  \  Hi  ~  >  — 10. 5@ 

a  =  0.01, 

n\  =  64,  xi  =  85.6,  =  2.4 

«2  =  50,  X2  =  95.3,  ^2  —  3.1 

b.  Test//o  :  —  Hi  ~  1 10vs- Ha  :  Hi  —  ^2  ^  1 10@ a  =  0.02, 

m  —  176,  x\  =  1918, si  =  68 

n2  =  241, 3c  2  =  1782,^2  =  146 

11.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach. 

a.  Test Hq  :  Hi  ~  M2  —  50vs. Ha  \  Hi  —  H 2  >  50@a  =  0.005, 
«i  =  72,  xi  =  272,  si  =  26 

«2  =  103,  x2  =  213,^2  =  14 

b.  Test//o  ■  f^i  ~  1^2  =  7.5vs ,Ha  :  Hi  ~  ^2  ^  7.5 @  a  —  0.10, 

ni  =  52, xi  =  94.3,  =  2.6 

« 2  =  38,^2  =  88.6,  ^2  =  8.0 

12.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach. 

a.  Test/fo  :  ^ i  ~  ^  2  ~  23vs  ,Ha  '.Hi  ~  ^2  <  23  @  a  =  0.20, 

ni  —  314,  xi  =  1 98,  =  12.2 

«2  =  220,^2  =  176,^2  =  11.5 

b.  Test  Hq  .  Hi  1^2  ~  4.4vs.  Hq  •  ftl  1^2  4.4 @  (X  —  0.05, 

ni  —  32,xi  =  40.3,  si  —  0.5 

ri2  =  30,  x2  =  35.5,  ^2  =  0.7 
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APPLICATIONS 


13.  In  order  to  investigate  the  relationship  between  mean  job  tenure  in  years 

among  workers  who  have  a  bachelor’s  degree  or  higher  and  those  who  do  not, 
random  samples  of  each  type  of  worker  were  taken,  with  the  following  results. 


n 

X 

s 

Bachelor’s  degree  or  higher 

155 

5.2 

1.3 

No  degree 

210 

5.0 

1.5 

a.  Construct  the  99%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  1%  level  of  significance,  the  claim  that  mean  job  tenure  among 
those  with  higher  education  is  greater  than  among  those  without,  against 
the  default  that  there  is  no  difference  in  the  means. 

c.  Compute  the  observed  significance  of  the  test. 

14.  Records  of  40  used  passenger  cars  and  40  used  pickup  trucks  (none  used 
commercially)  were  randomly  selected  to  investigate  whether  there  was  any 
difference  in  the  mean  time  in  years  that  they  were  kept  by  the  original  owner 
before  being  sold.  For  cars  the  mean  was  5.3  years  with  standard  deviation  2.2 
years.  For  pickup  trucks  the  mean  was  7.1  years  with  standard  deviation  3.0 
years. 

a.  Construct  the  95%  confidence  interval  for  the  difference  in  the  means 
based  on  these  data. 

b.  Test  the  hypothesis  that  there  is  a  difference  in  the  means  against  the  null 
hypothesis  that  there  is  no  difference.  Use  the  1%  level  of  significance. 

c.  Compute  the  observed  significance  of  the  test  in  part  (b). 

15.  In  previous  years  the  average  number  of  patients  per  hour  at  a  hospital 
emergency  room  on  weekends  exceeded  the  average  on  weekdays  by  6.3  visits 
per  hour.  A  hospital  administrator  believes  that  the  current  weekend  mean 
exceeds  the  weekday  mean  by  fewer  than  6.3  hours. 

a.  Construct  the  99%  confidence  interval  for  the  difference  in  the  population 
means  based  on  the  following  data,  derived  from  a  study  in  which  30 
weekend  and  30  weekday  one-hour  periods  were  randomly  selected  and 
the  number  of  new  patients  in  each  recorded. 


n 

X 

s 

Weekends 

30 

13.8 

3.1 
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n 

X 

s 

Weekdays 

30 

8.6 

2.7 

b.  Test  at  the  5%  level  of  significance  whether  the  current  weekend  mean 
exceeds  the  weekday  mean  by  fewer  than  6.3  patients  per  hour. 

c.  Compute  the  observed  significance  of  the  test. 

16.  A  sociologist  surveys  50  randomly  selected  citizens  in  each  of  two  countries  to 
compare  the  mean  number  of  hours  of  volunteer  work  done  by  adults  in  each. 
Among  the  50  inhabitants  of  Lilliput,  the  mean  hours  of  volunteer  work  per 
year  was  52,  with  standard  deviation  11.8.  Among  the  50  inhabitants  of 
Blefuscu,  the  mean  number  of  hours  of  volunteer  work  per  year  was  37,  with 
standard  deviation  7.2. 

a.  Construct  the  99%  confidence  interval  for  the  difference  in  mean  number 
of  hours  volunteered  by  all  residents  of  Lilliput  and  the  mean  number  of 
hours  volunteered  by  all  residents  of  Blefuscu. 

b.  Test,  at  the  1%  level  of  significance,  the  claim  that  the  mean  number  of 
hours  volunteered  by  all  residents  of  Lilliput  is  more  than  ten  hours 
greater  than  the  mean  number  of  hours  volunteered  by  all  residents  of 
Blefuscu. 

c.  Compute  the  observed  significance  of  the  test  in  part  (b). 

17.  A  university  administrator  asserted  that  upperclassmen  spend  more  time 
studying  than  underclassmen. 


a.  Test  this  claim  against  the  default  that  the  average  number  of  hours  of 
study  per  week  by  the  two  groups  is  the  same,  using  the  following 
information  based  on  random  samples  from  each  group  of  students.  Test  at 
the  1%  level  of  significance. 


b. 


n 

X 

s 

Upperclassmen 

35 

15.6 

2.9 

Underclassmen 

35 

12.3 

4.1 

Compute  the  observed  signi 


:icance  of  the  test. 


18.  An  kinesiologist  claims  that  the  resting  heart  rate  of  men  aged  18  to  25  who 
exercise  regularly  is  more  than  five  beats  per  minute  less  than  that  of  men  who 
do  not  exercise  regularly.  Men  in  each  category  were  selected  at  random  and 
their  resting  heart  rates  were  measured,  with  the  results  shown. 


n 

X 

s 

Regular  exercise 

40 

63 

1.0 
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n 

X 

s 

No  regular  exercise 

30 

71 

1.2 

a.  Perform  the  relevant  test  of  hypotheses  at  the  1%  level  of  significance. 

b.  Compute  the  observed  significance  of  the  test. 

19.  Children  in  two  elementary  school  classrooms  were  given  two  versions  of  the 
same  test,  but  with  the  order  of  questions  arranged  from  easier  to  more 
difficult  in  Version  A  and  in  reverse  order  in  Version  B.  Randomly  selected 
students  from  each  class  were  given  Version  A  and  the  rest  Version  B.  The 
results  are  shown  in  the  table. 


n 

X 

s 

Version  A 

31 

83 

4.6 

Version  B 

32 

78 

4.3 

a.  Construct  the  90%  confidence  interval  for  the  difference  in  the  means  of 
the  populations  of  all  children  taking  Version  A  of  such  a  test  and  of  all 
children  taking  Version  B  of  such  a  test. 

b.  Test  at  the  1%  level  of  significance  the  hypothesis  that  the  A  version  of  the 
test  is  easier  than  the  B  version  (even  though  the  questions  are  the  same). 

c.  Compute  the  observed  significance  of  the  test. 

20.  The  Municipal  Transit  Authority  wants  to  know  if,  on  weekdays,  more 

passengers  ride  the  northbound  blue  line  train  towards  the  city  center  that 
departs  at  8:15  a.m.  or  the  one  that  departs  at  8:30  a.m.  The  following  sample 
statistics  are  assembled  by  the  Transit  Authority. 


n 

X 

s 

8:15  a.m.  train 

30 

323 

41 

8:30  a.m.  train 

45 

356 

45 

a.  Construct  the  90%  confidence  interval  for  the  difference  in  the  mean 
number  of  daily  travellers  on  the  8:15  train  and  the  mean  number  of  daily 
travellers  on  the  8:30  train. 

b.  Test  at  the  5%  level  of  significance  whether  the  data  provide  sufficient 
evidence  to  conclude  that  more  passengers  ride  the  8:30  train. 

c.  Compute  the  observed  significance  of  the  test. 

21.  In  comparing  the  academic  performance  of  college  students  who  are  affiliated 
with  fraternities  and  those  male  students  who  are  unaffiliated,  a  random 
sample  of  students  was  drawn  from  each  of  the  two  populations  on  a  university 
campus.  Summary  statistics  on  the  student  GPAs  are  given  below. 
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n 

X 

s 

Fraternity 

645 

2.90 

0.47 

Unaffiliated 

450 

2.88 

0.42 

Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  there  is  a  difference  in  average  GPA  between  the 
population  of  fraternity  students  and  the  population  of  unaffiliated  male 
students  on  this  university  campus. 

22.  In  comparing  the  academic  performance  of  college  students  who  are  affiliated 
with  sororities  and  those  female  students  who  are  unaffiliated,  a  random 
sample  of  students  was  drawn  from  each  of  the  two  populations  on  a  university 
campus.  Summary  statistics  on  the  student  GPAs  are  given  below. 


n 

X 

s 

Sorority 

330 

3.18 

0.37 

Unaffiliated 

550 

3.12 

0.41 

Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  there  is  a  difference  in  average  GPA  between  the 
population  of  sorority  students  and  the  population  of  unaffiliated  female 
students  on  this  university  campus. 

23.  The  owner  of  a  professional  football  team  believes  that  the  league  has  become 
more  offense  oriented  since  five  years  ago.  To  check  his  belief,  32  randomly 
selected  games  from  one  year’s  schedule  were  compared  to  32  randomly 
selected  games  from  the  schedule  five  years  later.  Since  more  offense  produces 
more  points  per  game,  the  owner  analyzed  the  following  information  on  points 
per  game  (ppg). 


n 

X 

s 

ppg  previously 

32 

20.62 

4.17 

ppg  recently 

32 

22.05 

4.01 

Test,  at  the  10%  level  of  significance,  whether  the  data  on  points  per  game 
provide  sufficient  evidence  to  conclude  that  the  game  has  become  more 
offense  oriented. 

24.  The  owner  of  a  professional  football  team  believes  that  the  league  has  become 
more  offense  oriented  since  five  years  ago.  To  check  his  belief,  32  randomly 
selected  games  from  one  year’s  schedule  were  compared  to  32  randomly 
selected  games  from  the  schedule  five  years  later.  Since  more  offense  produces 
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more  offensive  yards  per  game,  the  owner  analyzed  the  following  information 
on  offensive  yards  per  game  (oypg). 


n 

X 

s 

oypg  previously 

32 

316 

40 

oypg  recently 

32 

336 

35 

Test,  at  the  10%  level  of  significance,  whether  the  data  on  offensive  yards  per 
game  provide  sufficient  evidence  to  conclude  that  the  game  has  become  more 
offense  oriented. 


LARGE  DATA  SET  EXERCISES 


25.  Large  Data  Sets  1A  and  IB  list  the  SAT  scores  for  1,000  randomly  selected 

students.  Denote  the  population  of  all  male  students  as  Population  1  and  the 

population  of  all  female  students  as  Population  2. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datalA.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/datalB.xls 

a.  Restricting  attention  to  just  the  males,  find  ni,  X  i ,  and  si.  Restricting 
attention  to  j ust  the  females,  find  n2,X2,  and  S2 . 

b.  Let  yU  j  denote  the  mean  SAT  score  for  all  males  and  jU2  the  mean  SAT 
score  for  all  females.  Use  the  results  of  part  (a)  to  construct  a  90% 
confidence  interval  for  the  difference  yMj  —  H 2. 

c.  Test,  at  the  5%  level  of  significance,  the  hypothesis  that  the  mean  SAT 
scores  among  males  exceeds  that  of  females. 

26.  Large  Data  Sets  1A  and  IB  list  the  GPAs  for  1,000  randomly  selected  students. 

Denote  the  population  of  all  male  students  as  Population  1  and  the  population 

of  all  female  students  as  Population  2. 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ datalA.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/datalB.xls 

a.  Restricting  attention  to  just  the  males,  find  n\,  X  \ ,  and  si.  Restricting 
attention  to  just  the  females,  find  n2,X2,  and  S2. 

b.  Let  yU  j  denote  the  mean  GPA  for  all  males  and  fl2  the  mean  GPA  for  all 
females.  Use  the  results  of  part  (a)  to  construct  a  95%  confidence  interval 
for  the  difference  [A  j  —  fi2. 

c.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  mean  GPAs 
among  males  and  females  differ. 
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27.  Large  Data  Sets  7A  and  7B  list  the  survival  times  for  65  male  and  75  female 

laboratory  mice  with  thymic  leukemia.  Denote  the  population  of  all  such  male 

mice  as  Population  1  and  the  population  of  all  such  female  mice  as  Population 

2. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7A.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/data7B.xls 

a.  Restricting  attention  to  just  the  males,  find  ni,X\,  and  si.  Restricting 
attention  to  j ust  the  females,  find  ri2,X2,  and  S2 . 

b.  Let  ft  |  denote  the  mean  survival  for  all  males  and  /12  the  mean  survival 
time  for  all  females.  Use  the  results  of  part  (a)  to  construct  a  99% 
confidence  interval  for  the  difference  fl j  —  /12  • 

c.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  the  mean  survival 
time  for  males  exceeds  that  for  females  by  more  than  182  days  (half  a 
year). 

d.  Compute  the  observed  significance  of  the  test  in  part  (c). 
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ANSWERS 


1. 

3. 

5. 

7. 

9. 

11. 

13. 


15. 


17. 

19. 


a.  (4.  20,5.  80)  , 

b.  (-18.54,-9.46) 

a.  (-12.81,-10.39)  , 

b.  (-76.50,-68.10) 

a.  Z  =  8.753,  ±C0.025  —  ±1 .960 ,  reject  Ho,  p-value  =  0.0000; 

b .  Z—  — 0.687  ,  — Zo.10  —  — 1-282 ,  do  not  reject  Ho,  p-value  =  0.2451 

a.  Z  =  2.444,  ±Zo.005  =  ±2.576 ,  do  not  reject  Ho,  p-value  =  0.0146. 

b.  Z  =  1.702,  Zo.05  —  1-645 ,  reject  Ho,  p-value  =  0.0446 

a.  Z  =  —1.19,  p-value  =  0.1170,  do  not  reject  Ho; 

b.  Z  =  —0.92 ,  p-value  =  0.3576,  do  not  reject  Ho 

a.  Z  =  2.68,  p-value  =  0.0037,  reject  Ho; 

b.  Z  =  —1.34 ,  p-value  =  0.1802,  do  not  reject  Ho 

a.  0.2  ±0.4, 

b.  Z  =  1.360,  Zo.01  —  2.326 ,  do  not  reject  Ho  (not  greater) 

c.  p-value  =  0.0869 

a.  5.2  ±  1.9, 

b.  Z  =  — 1.466  ,  — Zo.050  —  -1.645  ,  do  not  reject  Ho  (exceeds  by  6.3  or 
more) 

c.  p-value  =  0.0708 

a.  Z  =  3.888,  Zo.01  —  2.326 ,  reject  Ho  (upperclassmen  study  more) 

b.  p-value  =  0.0001 

a.  5  ±  1.8, 

b.  Z  =  4.454,  Zo.01  =  2.326,  reject  Ho  (Test  A  is  easier) 

c.  p-value  =  0.0000 


21.  Z  =  0.738,  ±Zo.025  —  ±  1 .960 ,  do  not  reject  H o  (no  difference) 


23.  Z  =  — 1.398  ,  — Zo.10  —  —1.282,  reject  Ho  (more  offense  oriented) 

25.  a.  hi  =  419,  Jj  =  1540.33,yi  =  205.40, n2  =  581, 

X2  —  1520.38  ,  and  ^2  —  217.34. 
b.  (-2.24,42.15) 
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c.  Hq  :  fly  —  H2  =  Ovs.  Ha  :  /ij  —  ^2  >  O.Test  Statistic:  Z  =  1.48. 
Rejection  Region:  [1.645  ,00).  Decision:  Fail  to  reject  Ho. 

27.  a.  Hi  =  65, x\  =  665.97, si  =  41.60, H2  =  75, X2  —  455.89, 
andS2  =  63.22. 

b.  (187.06,233.09) 

c.  Hq  \  Hy  —  fi 2  =  182vs./7fl  :  ^1  —  >  1 82. Test  Statistic:  Z  = 

3.14.  Rejection  Region:  [2.33,  00)  .  Decision:  Reject  Ho. 

d.  p  —  value  =  0.0008 
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9.2  Comparison  of  Two  Population  Means:  Small,  Independent  Samples 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  construct  a  confidence  interval  for  the  difference  in  the 
means  of  two  distinct  populations  using  small,  independent  samples. 

2.  To  learn  how  to  perform  a  test  of  hypotheses  concerning  the  difference 
between  the  means  of  two  distinct  populations  using  small,  independent 
samples. 


When  one  or  the  other  of  the  sample  sizes  is  small,  as  is  often  the  case  in  practice, 
the  Central  Limit  Theorem  does  not  apply.  We  must  then  impose  conditions  on  the 
population  to  give  statistical  validity  to  the  test  procedure.  We  will  assume  that 
both  populations  from  which  the  samples  are  taken  have  a  normal  probability 
distribution  and  that  their  standard  deviations  are  equal. 

Confidence  Intervals 

When  the  two  populations  are  normally  distributed  and  have  equal  standard 
deviations,  the  following  formula  for  a  confidence  interval  for  —  /i2  is  valid. 


100  (1  —  a)  %  Confidence  Interval  for  the  Difference 
Between  Two  Population  Means:  Small,  Independent 
Samples 


Ol  ~X2)  ±  ta/ 2 


where  Sp 


(ni-\)s\  +  ( n2 
n\  +  ri2— 2 


The  number  of  degrees  of  freedom  is  df  —  n\  +  7i2— 2. 

The  samples  must  be  independent,  the  populations  must  be  normal,  and  the 
population  standard  deviations  must  be  equal.  “Small”  samples  means  that 
either  n\  <  30  or  712  <  30. 


2 

2 
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The  quantity  sj,  is  called  the  pooled  sample  variance.  It  is  a  weighted  average  of 
the  two  estimates  and  .sy  of  the  common  variance  of  =  of  of  the  two 
populations. 
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EXAMPLE  4 


A  software  company  markets  a  new  computer  game  with  two  experimental 
packaging  designs.  Design  1  is  sent  to  11  stores;  their  average  sales  the  first 
month  is  52  units  with  sample  standard  deviation  12  units.  Design  2  is  sent 
to  6  stores;  their  average  sales  the  first  month  is  46  units  with  sample 
standard  deviation  10  units.  Construct  a  point  estimate  and  a  95% 
confidence  interval  for  the  difference  in  average  monthly  sales  between  the 
two  package  designs. 

Solution: 


The  point  estimate  of  yW  j  —  pi  2  is 

x\  —  X2  =  52  —  46  =  6 

In  words,  we  estimate  that  the  average  monthly  sales  for  Design  1  is  6  units 
more  per  month  than  the  average  monthly  sales  for  Design  2. 


To  apply  the  formula  for  the  confidence  interval,  we  must  find  ta/ 2  •  The 
95%  confidence  level  means  that  a  =  1  -  0.95  =  0.05  so  that  ta/ 2  =  fo.025  • 
From  Figure  12.3  "Critical  Values  of",  in  the  row  with  the  heading  df=  11  +  6 
-  2  =  15  we  read  that  4).025  —  2.131.  From  the  formula  for  the  pooled 
sample  variance  we  compute 


c2  - 

SP  — 


(m-i)jf+  («2-i)4  (io)  (12)2  +  (5)  (ioy 


n\  +  112  — 2 


15 


=  129.  3 


Thus 


(*1  ~X2)±ta/2 


=  6±  (2.131) 


We  are  95%  confident  that  the  difference  in  the  population  means  lies  in  the 
interval  [— 6.  3,18.  3]  ,  in  the  sense  that  in  repeated  sampling  95%  of  all 
intervals  constructed  from  the  sample  data  in  this  manner  will  contain 
yU  1  —  pi 2 .  Because  the  interval  contains  both  positive  and  negative  values 
the  statement  in  the  context  of  the  problem  is  that  we  are  95%  confident 
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that  the  average  monthly  sales  for  Design  1  is  between  18.3  units  higher  and 
6.3  units  lower  than  the  average  monthly  sales  for  Design  2. 


Hypothesis  Testing 

Testing  hypotheses  concerning  the  difference  of  two  population  means  using  small 
samples  is  done  precisely  as  it  is  done  for  large  samples,  using  the  following 
standardized  test  statistic.  The  same  conditions  on  the  populations  that  were 
required  for  constructing  a  confidence  interval  for  the  difference  of  the  means  must 
also  be  met  when  hypotheses  are  tested. 


Standardized  Test  Statistic  for  Hypothesis  Tests 
Concerning  the  Difference  Between  Two  Population 
Means:  Small,  Independent  Samples 

(Xl-J2)-A>  ,  2  (m-l)s?  +(«2-l)sf 

T  =  -  where  si  =  - - - 

n  i  +  n,2— 2 


The  test  statistic  has  Student’s  t-distribution  with  df  —  n\  +  ri2~ 2  degrees  of 
freedom. 

The  samples  must  be  independent,  the  populations  must  be  normal,  and  the 
population  standard  deviations  must  be  equal.  “Small”  samples  means  that 
either  n\  <  30  or  712  <  30. 
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EXAMPLE  5 


Refer  to  Note  9.11  "Example  4"  concerning  the  mean  sales  per  month  for  the 
same  computer  game  but  sold  with  two  package  designs.  Test  at  the  1%  level 
of  significance  whether  the  data  provide  sufficient  evidence  to  conclude  that 
the  mean  sales  per  month  of  the  two  designs  are  different.  Use  the  critical 
value  approach. 

Solution: 

•  Step  1.  The  relevant  test  is 

Ho  :  ixx  -  fa  =  0 

vs.  Ha  :  Hi  —  fi2  ^  0  @  a  =  0.01 

•  Step  2.  Since  the  samples  are  independent  and  at  least  one  is  less 
than  30  the  test  statistic  is 

„  (Ji  -  x2)  -  D0 


n\  n2 


which  has  Student’s  t-distribution  with 
df  =11+6-2  =  15  degrees  of  freedom. 

•  Step  3.  Inserting  the  data  and  the  value  Dq  =  0  into  the 
formula  for  the  test  statistic  gives 

^  ( X\-X2)-Do  (52  -  46)-0 


•  Step  4.  Since  the  symbol  in  Ha  is  ‘V”  this  is  a  two-tailed  test,  so 
there  are  two  critical  values,  ±ta/2  =  ±?o.005  •  From  the  row 
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in  Figure  12.3  "Critical  Values  of"  with  the  heading  df  =  1 5  we 
read  off  Fo.005  —  2.947.  The  rejection  region  is 
(-oo,-2.947]  u  [2.947,oo)  . 


Figure  9.4 

Rejection  Region  and  Test  Statistic  for  Note  9.13  "Example  5" 


•  Step  5.  As  shown  in  Figure  9.4  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  does  not  fall  in  the  rejection 
region.  The  decision  is  not  to  reject  Ho.  In  the  context  of  the 
problem  our  conclusion  is: 

The  data  do  not  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  sales  per  month  of  the 
two  designs  are  different. 
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EXAMPLE  6 


Perform  the  test  of  Note  9.13  "Example  5"  using  the  p-value  approach. 
Solution: 

The  first  three  steps  are  identical  to  those  in  Note  9.13  "Example  5". 


•  Step  4.  Because  the  test  is  two-tailed  the  observed  significance  or 
p-value  of  the  test  is  the  double  of  the  area  of  the  right  tail  of 
Student’s  t-distribution,  with  15  degrees  of  freedom,  that  is  cut 
off  by  the  test  statistic  T  =  1.040.  We  can  only  approximate  this 
number.  Looking  in  the  row  of  Figure  12.3  "Critical  Values  of" 
headed  df  =15,  the  number  1.040  is  between  the  numbers 
0.866  and  1.341,  corresponding  to  to.200  and  to.ioo- 

The  area  cut  off  by  t  =  0.866  is  0.200  and  the  area  cut  off  by  t  = 
1.341  is  0.100.  Since  1.040  is  between  0.866  and  1.341  the  area  it 
cuts  off  is  between  0.200  and  0.100.  Thus  the  p-value  (since  the 
area  must  be  doubled)  is  between  0.400  and  0.200. 

•  Step  5.  Since/?  >  0.200  >  0.01  ,/?  >  a,  so  the  decision  is 
not  to  reject  the  null  hypothesis: 

The  data  do  not  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  mean  sales  per  month  of  the 
two  designs  are  different. 


KEY  TAKEAWAYS 


•  In  the  context  of  estimating  or  testing  hypotheses  concerning  two 
population  means,  “small”  samples  means  that  at  least  one  sample  is 
small.  In  particular,  even  if  one  sample  is  of  size  30  or  more,  if  the  other 
is  of  size  less  than  30  the  formulas  of  this  section  must  be  used. 

•  A  confidence  interval  for  the  difference  in  two  population  means  is 
computed  using  a  formula  in  the  same  fashion  as  was  done  for  a  single 
population  mean. 
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In  all  exercises  for  this  section  assume  that  the  populations  are  normal  and 
have  equal  standard  deviations. 

1.  Construct  the  confidence  interval  for  [A  j  —  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  95%  confidence, 

n\  =  1 0,  J  |  =  120,  .v  i  =  2 
n2  =  15,  x2  =  101,  s2  =  4 

b.  99%  confidence, 

n\  =  6,  T|  =  25,.si  =  1 
n2  =  12,  x2  =  17,52  =  3 

2.  Construct  the  confidence  interval  for  fl  j  —  ji2  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  90%  confidence, 

n\  =  28,  x\  =  212,5j  =  6 
n2  =  23,  x2  =  198,52  =  5 

b.  99%  confidence, 

n\  =  14,  x\  =  68,5i  =  8 
n2  =  20,  x  2  =  43,52  =  3 

3.  Construct  the  confidence  interval  for  /i  j  —  fi2  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  99.9%  confidence, 

n\  =  35,  X]  =  6.5, 5i  =  0.2 
n2  =  20,  x  2  =  6.2,52  =  0.1 

b.  99%  confidence, 
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n\  =  18,  *1  =  77.3,5i  =  1.2 
«2  —  32,  X2  =  75.0,52  =  1-6 

4.  Construct  the  confidence  interval  for  pi  j  —  yU2  for  the  level  of  confidence  and 
the  data  from  independent  samples  given. 

a.  99.5%  confidence, 

n\  =  40,  xi  =  85.6,5i  =  2.8 
«2  —  20,  X  2  =  73.1,52  =  2.1 

b.  99.9%  confidence, 

«i  =  25, 3c  i  =  215,5i  =  7 

n2  =  35,  X2  =  185,52  =  12 

5.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach. 

a.  Test Hq  :  —  fil2  =  llvs .Ha  :  fll  —  fi2  >  11@«  =  0.025, 

n\  =  6,  J|  =  32,5i  =  2 

n2  =  11,  x2  =  19,52  =  1 

b.  Test  Hq  :  jWj  — //2  =  26vs  ,Ha  \  —  /^2  7^  26@  a  =  0.05, 

n  i  =  17, xi  =  166,5i  =  4 

«2  =  24,  x2  =  138,52  =  3 

6.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach. 

a.  Test//o  '•  ft  \  ~  ft 2  ~  40vs.  Ha  \  ~  ft2  <  40@  a  =  0.10, 

n\  =  14,  x\  =  289,5i  =  11 

n2  =  12,  x2  =  254,52  =  9 

b.  Test//o  :  ^i  —  //2  —  21vs.//a  :  ^i  —  ^/2  ^  21@ a  =  0.05, 

«i  =  23,  x\  =  130,5i  =  6 

n2  =  27,  x2  =  113,52  =8 
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7.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach. 

a.  Test Hq  :  —  fi2  =  — 15vs .Ha  :  <  — 15@ a  =  0.10, 

n\  =  30,  x\  =  42, yj  =  7 

n2  =  12,  x2  =  60,  y2  =  5 

b.  Test//o  :  — 1*2  =  103vs ,Ha  :  /A i  -  //j  ^  103@ a  =  0.10, 

n\  =  17, xj  =  711, yi  =  28 

n2  =  32,  J2  =  598,  y2  =  21 

8.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  critical  value  approach. 

a.  Test  Hq  :  —  n2  =  75vs  ,Ha  :  fix  —  ^2  >  75  @a  =  0.025, 

/ii  =  45,  x\  —  674, yj  =  18 

n2  =  29,  x  2  =  591,  y2  =  13 

b.  Test//o  ■  Mi  ~  M2  =  — 20vs ,Ha  :  —  \x 2  ^  —20 @  a  =  0.005, 

n\  =  30, xi  =  137,  yi  =  8 

n2  =  19,  x2  =  166,  y2  =  11 

9.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach.  (The  p-value  can  be  only 
approximated.) 

a.  Test//o  'Ml  ~  1*2  ~  1 2vs.Ha  fly  —  ^2  >  12@«  =  0.01, 

=  20,  Ji  =  133,  yi  =  7 
«2  =  10,  x2  =  115, y2  =  5 

b.  Test  Hq  \  —  /l2  =  46vs  ,Ha  \  fix  —  pi2  ±  46@  a  —  0.10, 

n\  =  24,  x\  =  586, yj  =  11 

n2  =  27,  x2  =  535,  y2  =  13 

10.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach.  (The  p-value  can  be  only 
approximated.) 
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a.  Test//o  '■  Mi  ~  Mi  =  38vs .Ha  \  [Ay  —  Mi  <  38@ a  —  0.01, 
n\  =  12, x  i  =  464,  si  =  5 

ni  =  10,  x  2  =  432,^2  =  6 

b.  Test//o  :  M\  ~  M2  =  4vs.//fl  :  [Ay  —  Mi  7^  4@  a  =  0.005, 

n  i  =  14,  xi  =  68,  =  2 

n2  =  11,  x  i  =  67,  s2  =  3 

11.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach.  (The  p-value  can  be  only 
approximated.) 

a.  Test//o  ■  Mi  ~  Mi  ~  50vs ,Ha  fA\  —  Mi  >  50@  a  =  0.01, 
ni  =  30,  x  i  =  681,  si  =8 

n2  =  21, x  i  =  625,  s2  =  8 

b.  Test//o  :  Mi  ~  Mi  ~  35vs.  Ha  '-Mi  ~  Mi  7^  35@  a  —  0.10, 
n  i  =  36,  x  i  =  325, si  =  11 

n2  =  29,  x  2  =  286,  s2  =  7 

12.  Perform  the  test  of  hypotheses  indicated,  using  the  data  from  independent 
samples  given.  Use  the  p-value  approach.  (The  p-value  can  be  only 
approximated.) 

a.  Test//o  •  Ml  ~  M2  =  — 4vs .  Ha  [Ax  —  M2  <  — 4 @  a  =  0.05, 
n  i  =  40,  xi  =  80, si  =  5 

n2  =  25,  x2  =  87 ,s2  =  5 

b.  Test//o  '■  Ml  ~  Ml  =  21vs ,Ha  \  {A x  -  \a2±  21@a  =  0.01, 
«i  =  15, xi  =  192, si  =  12 

n2  =  34, X2  =  180,  S2  =  8 
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13.  A  county  environmental  agency  suspects  that  the  fish  in  a  particular  polluted 
lake  have  elevated  mercury  level.  To  confirm  that  suspicion,  five  striped  bass 
in  that  lake  were  caught  and  their  tissues  were  tested  for  mercury.  For  the 
purpose  of  comparison,  four  striped  bass  in  an  unpolluted  lake  were  also 


caught  and  tested.  The  fish  tissue  mercury  levels  in  mg/kg  are  given  below. 


Sample  2 

(from  unpolluted  lake) 


Sample  1 

(from  polluted  lake) 


0.580 

0.711 

0.571 

0.666 

0.598 


0.382 

0.276 

0.570 

0.366 


a.  Construct  the  95%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  fish  in  the  polluted  lake  have  elevated  levels  of 
mercury  in  their  tissue. 

14.  A  genetic  engineering  company  claims  that  it  has  developed  a  genetically 
modified  tomato  plant  that  yields  on  average  more  tomatoes  than  other 
varieties.  A  farmer  wants  to  test  the  claim  on  a  small  scale  before  committing 
to  a  full-scale  planting.  Ten  genetically  modified  tomato  plants  are  grown  from 
seeds  along  with  ten  other  tomato  plants.  At  the  season’s  end,  the  resulting 
yields  in  pound  are  recorded  as  below. 
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Sample  1 

Sample  2 

(genetically  modified) 

(regular) 

20 

21 

23 

21 

27 

22 

25 

18 

25 

20 

25 

20 

27 

18 

23 

25 

24 

23 

22 

20 

a.  Construct  the  99%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  1%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  mean  yield  of  the  genetically  modified 
variety  is  greater  than  that  for  the  standard  variety. 

15.  The  coaching  staff  of  a  professional  football  team  believes  that  the  rushing 
offense  has  become  increasingly  potent  in  recent  years.  To  investigate  this 
belief,  20  randomly  selected  games  from  one  year’s  schedule  were  compared  to 
11  randomly  selected  games  from  the  schedule  five  years  later.  The  sample 
information  on  rushing  yards  per  game  (rypg)  is  summarized  below. 


n 

X 

s 

rypg  previously 

20 

112 

24 

rypg  recently 

11 

114 

21 

a.  Construct  the  95%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  5%  level  of  significance,  whether  the  data  on  rushing  yards  per 
game  provide  sufficient  evidence  to  conclude  that  the  rushing  offense  has 
become  more  potent  in  recent  years. 

16.  The  coaching  staff  of  professional  football  team  believes  that  the  rushing 
offense  has  become  increasingly  potent  in  recent  years.  To  investigate  this 
belief,  20  randomly  selected  games  from  one  year’s  schedule  were  compared  to 
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11  randomly  selected  games  from  the  schedule  five  years  later.  The  sample 
information  on  passing  yards  per  game  (pypg)  is  summarized  below. 


n 

X 

s 

pypg  previously 

20 

203 

38 

pypg  recently 

11 

232 

33 

a.  Construct  the  95%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  5%  level  of  significance,  whether  the  data  on  passing  yards  per 
game  provide  sufficient  evidence  to  conclude  that  the  passing  offense  has 
become  more  potent  in  recent  years. 

17.  A  university  administrator  wishes  to  know  if  there  is  a  difference  in  average 
starting  salary  for  graduates  with  master’s  degrees  in  engineering  and  those 
with  master’s  degrees  in  business.  Fifteen  recent  graduates  with  master’s 
degree  in  engineering  and  11  with  master’s  degrees  in  business  are  surveyed 
and  the  results  are  summarized  below. 


n 

X 

s 

Engineering 

15 

68,535 

1627 

Business 

11 

63,230 

2033 

a.  Construct  the  90%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  10%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  average  starting  salaries  are  different. 

18.  A  gardener  sets  up  a  flower  stand  in  a  busy  business  district  and  sells  bouquets 
of  assorted  fresh  flowers  on  weekdays.  To  find  a  more  profitable  pricing,  she 
sells  bouquets  for  15  dollars  each  for  ten  days,  then  for  10  dollars  each  for  five 
days.  Her  average  daily  profit  for  the  two  different  prices  are  given  below. 


n 

X 

s 

$15 

10 

171 

26 

$10 

5 

198 

29 

a.  Construct  the  90%  confidence  interval  for  the  difference  in  the  population 
means  based  on  these  data. 

b.  Test,  at  the  10%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  the  gardener’s  average  daily  profit  will  be  higher  if 
the  bouquets  are  sold  at  $10  each. 
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ANSWERS 


i.  a.  (16. 16,21.84)  , 

b.  (4.28,11.72) 

3.  a.  (0.13,0.47), 

b.  (1.14,3.46) 

5.  a.  T  =  2.787,  /q.025  —  2.131,  reject  Ho, 

b.  T  =  1.831,  ±?o.025  =  ±2.023  ,  do  not  reject  Ho 

7.  a.  T  =  — 1.349  ,  —to  io  =  — 1.303  ,  reject  Ho, 
b.  T  =  1.411,  ±?o.05  —  ±1-678  ,  do  not  reject  Ho 

9.  a.  T  =  2.411,t//  =  28,7? -value  >  0.01  ,  do  not  reject  Ho, 
b.  T  =  1.473,  df  =  49,7? -value  <  0.10  ,  reject  Ho 

11.  a.  T=  2.827,  df  =  55,7?-Value  <  0.01  ,  reject  Ho. 
b.  T  =  1.699,  df  =  63,  P -value  <  0.10,  reject  Ho 

13.  a.  0.2267  ±0.2182  , 

b.  T=  3.635,  df  =  7,  ?o.05  =  1  .895 ,  reject  Ho  (elevated  levels) 

15.  a.  —2  ±  17.7  , 

b.  T  =  — 0.232,  df  =  29,  —  ?o.05  —  — 1.699,  do  not  reject  Ho  (not 

more  potent) 

17.  a.  5305  ±  1227  , 

b.  T  =  7.395,  df  =  24,  ±?o.05  —  ±1.711  ,  reject  Hq  (different) 
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9.3  Comparison  of  Two  Population  Means:  Paired  Samples 


LEARNING  OBJECTIVES 

1.  To  learn  the  distinction  between  independent  samples  and  paired 
samples. 

2.  To  learn  how  to  construct  a  confidence  interval  for  the  difference  in  the 
means  of  two  distinct  populations  using  paired  samples. 

3.  To  learn  how  to  perform  a  test  of  hypotheses  concerning  the  difference 
in  the  means  of  two  distinct  populations  using  paired  samples. 


Suppose  chemical  engineers  wish  to  compare  the  fuel  economy  obtained  by  two 
different  formulations  of  gasoline.  Since  fuel  economy  varies  widely  from  car  to  car, 
if  the  mean  fuel  economy  of  two  independent  samples  of  vehicles  run  on  the  two 
types  of  fuel  were  compared,  even  if  one  formulation  were  better  than  the  other  the 
large  variability  from  vehicle  to  vehicle  might  make  any  difference  arising  from 
difference  in  fuel  difficult  to  detect.  Just  imagine  one  random  sample  having  many 
more  large  vehicles  than  the  other.  Instead  of  independent  random  samples,  it 
would  make  more  sense  to  select  pairs  of  cars  of  the  same  make  and  model  and 
driven  under  similar  circumstances,  and  compare  the  fuel  economy  of  the  two  cars 
in  each  pair.  Thus  the  data  would  look  something  like  Table  9.1  "Fuel  Economy  of 
Pairs  of  Vehicles",  where  the  first  car  in  each  pair  is  operated  on  one  formulation  of 
the  fuel  (call  it  Type  1  gasoline)  and  the  second  car  is  operated  on  the  second  (call  it 
Type  2  gasoline). 


Table  9.1  Fuel  Economy  of  Pairs  of  Vehicles 


Make  and  Model 

Car  1 

Car  2 

Buick  LaCrosse 

17.0 

17.0 

Dodge  Viper 

13.2 

12.9 

Honda  CR-Z 

35.3 

35.4 

Hummer  H  3 

13.6 

13.2 

Lexus  RX 

32.7 

32.5 

Mazda  CX-9 

18.4 

18.1 

Saab  9-3 

22.5 

22.5 
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Make  and  Model 

Car  1 

Car  2 

Toyota  Corolla 

26.8 

26.7 

Volvo  XC  90 

15.1 

15.0 

The  first  column  of  numbers  form  a  sample  from  Population  1,  the  population  of  all 
cars  operated  on  Type  1  gasoline;  the  second  column  of  numbers  form  a  sample 
from  Population  2,  the  population  of  all  cars  operated  on  Type  2  gasoline.  It  would 
be  incorrect  to  analyze  the  data  using  the  formulas  from  the  previous  section, 
however,  since  the  samples  were  not  drawn  independently.  What  is  correct  is  to 
compute  the  difference  in  the  numbers  in  each  pair  (subtracting  in  the  same  order 
each  time)  to  obtain  the  third  column  of  numbers  as  shown  in  Table  9.2  "Fuel 
Economy  of  Pairs  of  Vehicles"  and  treat  the  differences  as  the  data.  At  this  point, 
the  new  sample  of  differences  d[  —  0.0,  ...  ,dg  —  0. 1  in  the  third  column  of  Table 
9.2  "Fuel  Economy  of  Pairs  of  Vehicles"  may  be  considered  as  a  random  sample  of 
size  n  =  9  selected  from  a  population  with  mean  fa  =  fa  —  fa.  This  approach 
essentially  transforms  the  paired  two-sample  problem  into  a  one-sample  problem 
as  discussed  in  the  previous  two  chapters. 


Table  9.2  Fuel  Economy  of  Pairs  of  Vehicles 


Make  and  Model 

Car  1 

Car  2 

Difference 

Buick  LaCrosse 

17.0 

17.0 

0.0 

Dodge  Viper 

13.2 

12.9 

0.3 

Honda  CR-Z 

35.3 

35.4 

-0.1 

Hummer  H  3 

13.6 

13.2 

0.4 

Lexus  RX 

32.7 

32.5 

0.2 

Mazda  CX-9 

18.4 

18.1 

0.3 

Saab  9-3 

22.5 

22.5 

0.0 

Toyota  Corolla 

26.8 

26.7 

0.1 

Volvo  XC  90 

15.1 

15.0 

0.1 

Note  carefully  that  although  it  does  not  matter  what  order  the  subtraction  is  done, 
it  must  be  done  in  the  same  order  for  all  pairs.  This  is  why  there  are  both  positive 
and  negative  quantities  in  the  third  column  of  numbers  in  Table  9.2  "Fuel  Economy 
of  Pairs  of  Vehicles". 
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Confidence  Intervals 

When  the  population  of  differences  is  normally  distributed  the  following  formula 
for  a  confidence  interval  for  f,id  —  —  /J2  is  valid. 

100  (1  —  (X)  %  Confidence  Interval  for  the  Difference 
Between  Two  Population  Means:  Paired  Difference 
Samples 

1  Sd 
a  ±  ta/ 2 - 

where  there  are  n  pairs,  d  is  the  mean  and  sd  is  the  standard  deviation  of  their 
differences. 

The  number  of  degrees  of  freedom  is  df  —  n—  1 . 

The  population  of  differences  must  be  normally  distributed. 
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EXAMPLE  7 


Using  the  data  in  Table  9.1  "Fuel  Economy  of  Pairs  of  Vehicles"  construct  a 
point  estimate  and  a  95%  confidence  interval  for  the  difference  in  average 
fuel  economy  between  cars  operated  on  Type  1  gasoline  and  cars  operated 
on  Type  2  gasoline. 

Solution: 

We  have  referred  to  the  data  in  Table  9.1  "Fuel  Economy  of  Pairs  of 
Vehicles"  because  that  is  the  way  that  the  data  are  typically  presented,  but 
we  emphasize  that  with  paired  sampling  one  immediately  computes  the 
differences,  as  given  in  Table  9.2  "Fuel  Economy  of  Pairs  of  Vehicles",  and 
uses  the  differences  as  the  data. 

The  mean  and  standard  deviation  of  the  differences  are 


-  Xd 
a  =  — 


n 


1.3  n|I  , 

=  0.14  and  Sd  = 


n—  1 


0.41  -  i  (1.3 


8 


The  point  estimate  ofyWj  —  fi 2  —  //^/is 


d  =  0.14 

In  words,  we  estimate  that  the  average  fuel  economy  of  cars  using  Type  1 
gasoline  is  0.14  mpg  greater  than  the  average  fuel  economy  of  cars  using 
Type  2  gasoline. 


To  apply  the  formula  for  the  confidence  interval,  we  must  find  ta/ 2 .  The 
95%  confidence  level  means  that  a  =  1  —  0.95  =  0.05  so  that 
ta/2  —  A). 025  •  From  Figure  12.3  "Critical  Values  of",  in  the  row  with  the 
heading  df  =  9  —  1  =  8  we  read  that  Fq.025  —  2.306.  Thus 


Sd 


d  +  ta/2 -  —  0.14  +  2.306 


(  _\ 

0.16 


y/9 


0.14  +  0.13 
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We  are  95%  confident  that  the  difference  in  the  population  means  lies  in  the 
interval  [0.01,0.27]  ,  in  the  sense  that  in  repeated  sampling  95%  of  all 
intervals  constructed  from  the  sample  data  in  this  manner  will  contain 
yW  j  =  [A  |  —  //2  •  Stated  differently,  we  are  95%  confident  that  mean  fuel 
economy  is  between  0.01  and  0.27  mpg  greater  with  Type  1  gasoline  than 
with  Type  2  gasoline. 


Hypothesis  Testing 

Testing  hypotheses  concerning  the  difference  of  two  population  means  using  paired 
difference  samples  is  done  precisely  as  it  is  done  for  independent  samples,  although 
now  the  null  and  alternative  hypotheses  are  expressed  in  terms  of  j.id  instead  of 
// !  —  /J2-  Thus  the  null  hypothesis  will  always  be  written 

Ho  :  fid  =  D0 

The  three  forms  of  the  alternative  hypothesis,  with  the  terminology  for  each  case, 
are: 


Form  of  Ha 

Terminology 

Ha  ■  /^d  ^  Dq 

Left-tailed 

Ha  ■  H-d  ^  D0 

Right-tailed 

Ha  ■  l^d  7^  Dq 

Two-tailed 

The  same  conditions  on  the  population  of  differences  that  was  required  for 
constructing  a  confidence  interval  for  the  difference  of  the  means  must  also  be  met 
when  hypotheses  are  tested.  Here  is  the  standardized  test  statistic  that  is  used  in 
the  test. 
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Standardized  Test  Statistic  for  Hypothesis  Tests 
Concerning  the  Difference  Between  Two  Population 
Means:  Paired  Difference  Samples 

T  _  d  -Dp 

sd  /  yfi 


where  there  are  n  pairs,  d  is  the  mean  and  Sd  is  the  standard  deviation  of  their 
differences. 


The  test  statistic  has  Student’s  t-distribution  with  df  —  n— 1  degrees  of 
freedom. 


The  population  of  differences  must  be  normally  distributed. 
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EXAMPLE  8 


Using  the  data  of  Table  9.2  "Fuel  Economy  of  Pairs  of  Vehicles"  test  the 
hypothesis  that  mean  fuel  economy  for  Type  1  gasoline  is  greater  than  that 
for  Type  2  gasoline  against  the  null  hypothesis  that  the  two  formulations  of 
gasoline  yield  the  same  mean  fuel  economy.  Test  at  the  5%  level  of 
significance  using  the  critical  value  approach. 

Solution: 

The  only  part  of  the  table  that  we  use  is  the  third  column,  the  differences. 

•  Step  1.  Since  the  differences  were  computed  in  the  order 

Type  1  mpg  -  Type  2  mpg  ,  better  fuel  economy  with 
Type  1  fuel  corresponds  to^/j  =  — //2  >  O.Thus  the  test 

is 

Hq  :  \id  =  0 

vs.  Ha  :  /Ad  >  0  @  a  =  0.05 

(if  the  differences  had  been  computed  in  the  opposite  order  then 
the  alternative  hypotheses  would  have  been  Ha  \  lld  <  0.) 

•  Step  2.  Since  the  sampling  is  in  pairs  the  test  statistic  is 


•  Step  3.  We  have  already  computed  d  and  sd  in  the  previous 
example.  Inserting  their  values  and  D q  =  0  into  the  formula 
for  the  test  statistic  gives 


T  = 


d  —  Dr 


0.14 


Sd 


/y/a  016  /  3 


=  2.600 


Step  4.  Since  the  symbol  in  Ha  is  “>”  this  is  a  right-tailed  test,  so  there  is 
a  single  critical  value,  ta  =  ?q.05  whh  8  degrees  of  freedom,  which  from 
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the  row  labeled  df  =  8  in  Figure  12.3  "Critical  Values  of"  we  read  off 
as  1.860.  The  rejection  region  is  [l.860,  oo)  . 

•  Step  5.  As  shown  in  Figure  9.5  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

Figure  9.5 

Rejection  Region  and  Test  Statistic  for  Note  9.20  "Example  8" 

Ha  ■  dd  -'>  0 


T  =  2.600 


The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  the  mean  fuel  economy  provided 
by  Type  1  gasoline  is  greater  than  that  for  Type  2  gasoline. 
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Perform  the  test  of  Note  9.20  "Example  8"  using  the  p-value  approach. 
Solution: 

The  first  three  steps  are  identical  to  those  in  Note  9.20  "Example  8". 

•  Step  4.  Because  the  test  is  one-tailed  the  observed  significance  or 
p-value  of  the  test  is  just  the  area  of  the  right  tail  of  Student’s  t- 
distribution,  with  8  degrees  of  freedom,  that  is  cut  off  by  the  test 
statistic  T  =  2.600.  We  can  only  approximate  this  number. 
Looking  in  the  row  of  Figure  12.3  "Critical  Values  of"  headed 
df  =  8,  the  number  2.600  is  between  the  numbers  2.306  and 
2.896,  corresponding  to  to.025  and  to.oio- 

The  area  cut  off  by  t  =  2.306  is  0.025  and  the  area  cut  off  by  t  = 
2.896  is  0.010.  Since  2.600  is  between  2.306  and  2.896  the  area  it 
cuts  off  is  between  0.025  and  0.010.  Thus  the  p-value  is  between 
0.025  and  0.010.  In  particular  it  is  less  than  0.025.  See  Figure  9.6. 

Figure  9.6 

P-  Value  for  Note  9.21  "Examvle  9" 


area  —  p  -value 


area  =  0.010 


t 


0  2.306  2.896 


T  =  2.6 


•  Step  5.  Since  0.025  <  0.05,/?  <  a  so  the  decision  is  to  reject  the 
null  hypothesis: 
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The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  the  mean  fuel  economy  provided 
by  Type  1  gasoline  is  greater  than  that  for  Type  2  gasoline. 


The  paired  two-sample  experiment  is  a  very  powerful  study  design.  It  bypasses 
many  unwanted  sources  of  “statistical  noise”  that  might  otherwise  influence  the 
outcome  of  the  experiment,  and  focuses  on  the  possible  difference  that  might  arise 
from  the  one  factor  of  interest. 


if  the  sample  is  large  (meaning  that  n  2  30)  then  in  the  formula  for  the  confidence 
interval  we  may  replace  ta/lbyz  a/2-  For  hypothesis  testing  when  the  number  of 
pairs  is  at  least  30,  we  may  use  the  same  statistic  as  for  small  samples  for  hypothesis 
testing,  except  now  it  follows  a  standard  normal  distribution,  so  we  use  the  last  line 
of  Figure  12.3  "Critical  Values  of "  to  compute  critical  values,  and  p-values  can  be 
computed  exactly  with  Figure  12.2  "Cumulative  Normal  Probability",  not  merely 
estimated  using  Figure  12.3  "Critical  Values  of". 


KEY  TAKEAWAYS 


•  When  the  data  are  collected  in  pairs,  the  differences  computed  for  each 
pair  are  the  data  that  are  used  in  the  formulas. 

•  A  confidence  interval  for  the  difference  in  two  population  means  using 
paired  sampling  is  computed  using  a  formula  in  the  same  fashion  as  was 
done  for  a  single  population  mean. 

•  The  same  five-step  procedure  used  to  test  hypotheses  concerning  a 
single  population  mean  is  used  to  test  hypotheses  concerning  the 
difference  between  two  population  means  using  pair  sampling.  The  only 
difference  is  in  the  formula  for  the  standardized  test  statistic. 
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In  all  exercises  for  this  section  assume  that  the  population  of  differences  is 
normal. 


1.  Use  the  following  paired  sample  data  for  this  exercise. 

Population  1  35  32  35  35  36  35  36 

Population  2  28  26  27  26  29  27  29 

a.  Compute  d  and  sd- 

b.  Give  a  point  estimate  for  [A  j  —  =  Md- 

c.  Construct  the  95%  confidence  interval  for  [A  j  —  /A 2  —  jMjfrom  these 
data. 

d.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  /A  j  —  ^2  >  7  as 
an  alternative  to  the  null  hypothesis  that  /A  j  —  pi 2  =  7. 

2.  Use  the  following  paired  sample  data  for  this  exercise. 

Population  1  103  127  96  110 

Population  2  81  106  73  88 
Population  1  90  118  130  106 

Population  2  70  95  109  83 

a.  Compute  d  and  sd ■ 

b.  Give  a  point  estimate  for  [A  j  —  =  Md- 

c.  Construct  the  90%  confidence  interval  for  [A  j  —  =  dd from  these 

data. 

d.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  yMj  —^2  <24 
as  an  alternative  to  the  null  hypothesis  that  [A  j  —  pi 2  =  24. 

3.  Use  the  following  paired  sample  data  for  this  exercise. 

Population  1  40  27  55  34 
Population  2  53  42  68  50 

a.  Compute  d  and  sd- 

b.  Give  a  point  estimate  for  /A  j  ~  H2  —  /Ad- 

c.  Construct  the  99%  confidence  interval  for  fA  y  —^2  —  yWjfrom  these 
data. 
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d.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that 

[A  i  —  fl  2  —  1 2  as  an  alternative  to  the  null  hypothesis  that 

1*1  -1*2  =  ~12- 

4.  Use  the  following  paired  sample  data  for  this  exercise. 

Population  1  196  165  181  201  190 

Population  2  212  182  199  210  205 

a.  Compute  d  and  sd . 

b.  Give  a  point  estimate  for  —  ^2  =  Md- 

c.  Construct  the  98%  confidence  interval  for  [A  j  —  ^/2  =  from  these 
data. 

d.  Test,  at  the  2%  level  of  significance,  the  hypothesis  that //j  —  yl/2  7^  —  20 
as  an  alternative  to  the  null  hypothesis  that  [A  j  —  flj  —  —  20. 


APPLICATIONS 


5.  Each  of  five  laboratory  mice  was  released  into  a  maze  twice.  The  five  pairs  of 
times  to  escape  were: 


Mouse 

1 

2 

3 

4 

5 

First  release 

129 

89 

136 

163 

118 

Second  release 

113 

97 

139 

85 

75 

a.  Compute  d  and  sd . 

b.  Give  a  point  estimate  for  [A  j  —  ^2  =  Md- 

c.  Construct  the  90%  confidence  interval  for  [A  j  —  yl/2  =  yUjfrom  these 
data. 

d.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  it  takes  mice  less 
time  to  run  the  maze  on  the  second  trial,  on  average. 

6.  Eight  golfers  were  asked  to  submit  their  latest  scores  on  their  favorite  golf 
courses.  These  golfers  were  each  given  a  set  of  newly  designed  clubs.  After 
playing  with  the  new  clubs  for  a  few  months,  the  golfers  were  again  asked  to 
submit  their  latest  scores  on  the  same  golf  courses.  The  results  are 
summarized  below. 


Golfer 

1 

2 

3 

4 

5 

6 

7 

8 

Own  clubs 

77 

80 

69 

73 

73 

72 

75 

77 

New  clubs 

72 

81 

68 

73 

75 

70 

73 

75 
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a.  Compute  d  and  sd- 

b.  Give  a  point  estimate  for  fly  —  =  Md- 

c.  Construct  the  99%  confidence  interval  for  j  —  ^2  —  yWjfrom  these 
data. 

d.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  on  average  golf 
scores  are  lower  with  the  new  clubs. 

7.  A  neighborhood  home  owners  association  suspects  that  the  recent  appraisal 
values  of  the  houses  in  the  neighborhood  conducted  by  the  county  government 
for  taxation  purposes  is  too  high.  It  hired  a  private  company  to  appraise  the 
values  of  ten  houses  in  the  neighborhood.  The  results,  in  thousands  of  dollars, 
are 


House 

County  Government 

Private  Company 

1 

217 

219 

2 

350 

338 

3 

296 

291 

4 

237 

237 

5 

237 

235 

6 

272 

269 

7 

257 

239 

8 

277 

275 

9 

312 

320 

10 

335 

335 

a.  Give  a  point  estimate  for  the  difference  between  the  mean  private 
appraisal  of  all  such  homes  and  the  government  appraisal  of  all  such 
homes. 

b.  Construct  the  99%  confidence  interval  based  on  these  data  for  the 
difference. 

c.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  appraised  values 
by  the  county  government  of  all  such  houses  is  greater  than  the  appraised 
values  by  the  private  appraisal  company. 

8.  In  order  to  cut  costs  a  wine  producer  is  considering  using  duo  or  1  +  1  corks  in 
place  of  full  natural  wood  corks,  but  is  concerned  that  it  could  affect  buyers’s 
perception  of  the  quality  of  the  wine.  The  wine  producer  shipped  eight  pairs  of 
bottles  of  its  best  young  wines  to  eight  wine  experts.  Each  pair  includes  one 
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bottle  with  a  natural  wood  cork  and  one  with  a  duo  cork.  The  experts  are  asked 
to  rate  the  wines  on  a  one  to  ten  scale,  higher  numbers  corresponding  to 
higher  quality.  The  results  are: 


Wine  Expert 

Duo  Cork 

Wood  Cork 

1 

8.5 

8.5 

2 

8.0 

8.5 

3 

6.5 

8.0 

4 

7.5 

8.5 

5 

8.0 

7.5 

6 

8.0 

8.0 

7 

9.0 

9.0 

8 

7.0 

7.5 

a.  Give  a  point  estimate  for  the  difference  between  the  mean  ratings  of  the 
wine  when  bottled  are  sealed  with  different  kinds  of  corks. 

b.  Construct  the  90%  confidence  interval  based  on  these  data  for  the 
difference. 

c.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  on  the  average 
duo  corks  decrease  the  rating  of  the  wine. 

9.  Engineers  at  a  tire  manufacturing  corporation  wish  to  test  a  new  tire  material 
for  increased  durability.  To  test  the  tires  under  realistic  road  conditions,  new 
front  tires  are  mounted  on  each  of  11  company  cars,  one  tire  made  with  a 
production  material  and  the  other  with  the  experimental  material.  After  a 
fixed  period  the  11  pairs  were  measured  for  wear.  The  amount  of  wear  for  each 
tire  (in  mm)  is  shown  in  the  table: 


Car 

Production 

Experimental 

1 

5.1 

5.0 

2 

6.5 

6.5 

3 

3.6 

3.1 

4 

3.5 

3.7 

5 

5.7 

4.5 

6 

5.0 

4.1 

7 

6.4 

5.3 
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Car 

Production 

Experimental 

8 

4.7 

2.6 

9 

3.2 

3.0 

10 

3.5 

3.5 

11 

6.4 

5.1 

a.  Give  a  point  estimate  for  the  difference  in  mean  wear. 

b.  Construct  the  99%  confidence  interval  for  the  difference  based  on  these 
data. 

c.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  the  mean  wear 
with  the  experimental  material  is  less  than  that  for  the  production 
material. 

10.  A  marriage  counselor  administered  a  test  designed  to  measure  overall 
contentment  to  30  randomly  selected  married  couples.  The  scores  for  each 
couple  are  given  below.  A  higher  number  corresponds  to  greater  contentment 
or  happiness. 


Couple 

Husband 

Wife 

1 

47 

44 

2 

44 

46 

3 

49 

44 

4 

53 

44 

5 

42 

43 

6 

45 

45 

7 

48 

47 

8 

45 

44 

9 

52 

44 

10 

47 

42 

11 

40 

34 

12 

45 

42 

13 

40 

43 

14 

46 

41 
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Couple 

Husband 

Wife 

15 

47 

45 

16 

46 

45 

17 

46 

41 

18 

46 

41 

19 

44 

45 

20 

45 

43 

21 

48 

38 

22 

42 

46 

23 

50 

44 

24 

46 

51 

25 

43 

45 

26 

50 

40 

27 

46 

46 

28 

42 

41 

29 

51 

41 

30 

46 

47 

a.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  on  average  men 
and  women  are  not  equally  happy  in  marriage. 

b.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  on  average  men 
are  happier  than  women  in  marriage. 


LARGE  DATA  SET  EXERCISES 


11.  Large  Data  Set  5  lists  the  scores  for  25  randomly  selected  students  on  practice 
SAT  reading  tests  before  and  after  taking  a  two-week  SAT  preparation  course. 
Denote  the  population  of  all  students  who  have  taken  the  course  as  Population 
1  and  the  population  of  all  students  who  have  not  taken  the  course  as 
Population  2. 

http://www.gone.2012books.lardbucket.org/sites/all/files/data5.xls 
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a.  Compute  the  25  differences  in  the  order  after  —  before,  their  mean  d, 
and  their  sample  standard  deviation  sd- 

b.  Give  a  point  estimate  for  =  fiy  —  //2,  the  difference  in  the  mean  score 
of  all  students  who  have  taken  the  course  and  the  mean  score  of  all  who 
have  not. 

c.  Construct  a  98%  confidence  interval  for  ft d . 

d.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  the  mean  SAT 
score  increases  by  at  least  ten  points  by  taking  the  two-week  preparation 
course. 

12.  Large  Data  Set  12  lists  the  scores  on  one  round  for  75  randomly  selected 
members  at  a  golf  course,  first  using  their  own  original  clubs,  then  two  months 
later  after  using  new  clubs  with  an  experimental  design.  Denote  the  population 
of  all  golfers  using  their  own  original  clubs  as  Population  1  and  the  population 
of  all  golfers  using  the  new  style  clubs  as  Population  2. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

a.  Compute  the  75  differences  in  the  order  original  clubs  —  new  clubs  , 
their  mean  d,  and  their  sample  standard  deviation  sd- 

b.  Give  a  point  estimate  for  ftd  =  ft  j  —  ft2,  the  difference  in  the  mean  score 
of  all  golfers  using  their  original  clubs  and  the  mean  score  of  all  golfers 
using  the  new  kind  of  clubs. 

c.  Construct  a  90%  confidence  interval  for  ftd . 

d.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  the  mean  golf 
score  decreases  by  at  least  one  stroke  by  using  the  new  kind  of  clubs. 

13.  Consider  the  previous  problem  again.  Since  the  data  set  is  so  large,  it  is 
reasonable  to  use  the  standard  normal  distribution  instead  of  Student’s  t- 
distribution  with  74  degrees  of  freedom. 

a.  Construct  a  90%  confidence  interval  for  ft  j  using  the  standard  normal 
distribution,  meaning  that  the  formula  is  d  ±  Za/2  ~  -(The 

V" 

computations  done  in  part  (a)  of  the  previous  problem  still  apply  and  need 
not  be  redone.)  How  does  the  result  obtained  here  compare  to  the  result 
obtained  in  part  (c)  of  the  previous  problem? 

b.  Test,  at  the  1%  level  of  significance,  the  hypothesis  that  the  mean  golf 
score  decreases  by  at  least  one  stroke  by  using  the  new  kind  of  clubs,  using 
the  standard  normal  distribution.  (All  the  work  done  in  part  (d)  of  the 
previous  problem  applies,  except  the  critical  value  is  now  Za  instead  of  ta 
(or  the  p-value  can  be  computed  exactly  instead  of  only  approximated,  if 
you  used  the  p-value  approach).)  How  does  the  result  obtained  here 
compare  to  the  result  obtained  in  part  (c)  of  the  previous  problem? 
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c.  Construct  the  99%  confidence  intervals  for  fl  j  using  both  the  t-  and 
Z~ distributions.  How  much  difference  is  there  in  the  results  now? 
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ANSWERS 


i.  a.  d  =  7.4286  ,sd  =  0.9759, 
b.  d  =  7.4286, 

C.  (6.53,8.33), 

d.  T  =  1.162,  df  =  6,  Jo.  10  —  1-44,  do  not  reject  Ho 

3.  a.  d  =  —14.25,  Sd  =  1.5, 
b.  d  =  -14.25, 

C.  (-18.63,-9.87)  , 

d.  7  =  — 3.000,  df  =  3,  ±?o.05  —  ±2.353  ,  reject  Ho 
5.  a.  d  =  25.2,  Sd  =  35.6609, 

b.  25.2, 

C.  25.2  ±34.0 

d.  r=  1.580,  df  =  4,  to  AO  =  1  .533 ,  reject  Ho  (takes  less  time) 

7.  a.  3.2, 

b.  3.2  ±7.5 

C.  r=  1.392,  df  =  9,/o.io  =  2  .821 ,  do  not  reject  Ho  (government 
appraisals  not  higher) 

9.  a.  0.65, 

b.  0.65  ±  0.69  , 

c.  T  =  3.014,  df  =  10,  *0.01  =  2  .764 ,  reject  Ho  (experimental  material 
wears  less) 

11.  a.  d  —  16.68  and  Sd  =  10.77 

b .  d  —  16.68 
C.  (11.31,22.05) 

d.  Hq  '.  fl]  — /A 2  =  lOvs ,Ha  \  >  10. Test  Statistic:  T=  3.1014. 

d.f.  =  24.  Rejection  Region:  [2.492,  oo)  .  Decision:  Reject  Ho. 

13.  a.  (  1 .  6266,2.  6401 )  .  Endpoints  change  in  the  third  deciaml  place. 

b.  Hq  :  ply  —  fA2  —  lvs ,Ha  :  /A \  —  ^2  >  1  .Test  Statistic:  Z  =  3.6791. 
Rejection  Region:  [2.33,  oo)  .  Decision:  Reject  Ho.  The  decision  is  the 
same  as  in  the  previous  problem. 

c.  Using  the  t  —distribution,  (1.  3188,2.  9478)  .  Using  the 

Z  —distribution,  (l.  3401,2.  9266)  .  There  is  a  difference. 
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9.4  Comparison  of  Two  Population  Proportions 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  construct  a  confidence  interval  for  the  difference  in  the 
proportions  of  two  distinct  populations  that  have  a  particular 
characteristic  of  interest. 

2.  To  learn  how  to  perform  a  test  of  hypotheses  concerning  the  difference 
in  the  proportions  of  two  distinct  populations  that  have  a  particular 
characteristic  of  interest. 


Suppose  we  wish  to  compare  the  proportions  of  two  populations  that  have  a  specific 
characteristic,  such  as  the  proportion  of  men  who  are  left-handed  compared  to  the 
proportion  of  women  who  are  left-handed.  Figure  9.7  "Independent  Sampling  from 
Two  Populations  In  Order  to  Compare  Proportions"  illustrates  the  conceptual 
framework  of  our  investigation.  Each  population  is  divided  into  two  groups,  the 
group  of  elements  that  have  the  characteristic  of  interest  (for  example,  being  left- 
handed)  and  the  group  of  elements  that  do  not.  We  arbitrarily  label  one  population 
as  Population  1  and  the  other  as  Population  2,  and  subscript  the  proportion  of  each 
population  that  possesses  the  characteristic  with  the  number  1  or  2  to  tell  them 
apart.  We  draw  a  random  sample  from  Population  1  and  label  the  sample  statistic  it 
yields  with  the  subscript  1.  Without  reference  to  the  first  sample  we  draw  a  sample 
from  Population  2  and  label  its  sample  statistic  with  the  subscript  2. 
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Figure  9.7  Independent  Sampling  from  Two  Populations  In  Order  to  Compare  Proportions 

Population  2 


proportion :  P2 


Population  1 


proportion :  p\ 


\/ 

Sample  1 


\/ 

Sample  2 


size :  n\ 
successes :  x\ 
proportion :  p\ 

size :  ri2 
successes'.  X2 
proportion :  p2 

Our  goal  is  to  use  the  information  in  the  samples  to  estimate  the  difference  —  p2 
in  the  two  population  proportions  and  to  make  statistically  valid  inferences  about  it. 

Confidence  Intervals 

Since  the  sample  proportion  p  l  computed  using  the  sample  drawn  from  Population 
1  is  a  good  estimator  of  population  proportion  pi  of  Population  1  and  the  sample 
proportion  p2  computed  using  the  sample  drawn  from  Population  2  is  a  good 
estimator  of  population  proportion  p2  of  Population  2,  a  reasonable  point  estimate 

of  the  difference  p{  —  p2  is  P\  —p2.\n  order  to  widen  this  point  estimate  into  a 
confidence  interval  we  suppose  that  both  samples  are  large,  as  described  in  Section 
7.3  "Large  Sample  Estimation  of  a  Population  Proportion"  in  Chapter  7  "Estimation" 
and  repeated  below,  if  so,  then  the  following  formula  for  a  confidence  interval  for 
Pi  ~  Pi  is  valid. 
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100  (1  —  (x)  %  Confidence  Interval  for  the  Difference 
Between  Two  Population  Proportions 


(Pi  ~  Pi  )  ±Za/2 


Pi  (!  -Pi)  +  P2  (1  -  Pi) 


n  i 


n  2 


The  samples  must  be  independent,  and  each  sample  must  be  large:  each  of  the 
intervals 


Pl(l  ~  Pi) 
ni 


,P  i+3 


PiO_  -Pi) 
ni 


and 


p2-3 


]P2^~P2) 

«2 


,P2+  3 


p2a-p2) 

«2 


must  lie  wholly  within  the  interval  [0, 1  ]  . 
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EXAMPLE  10 


The  department  of  code  enforcement  of  a  county  government  issues  permits 
to  general  contractors  to  work  on  residential  projects.  For  each  permit 
issued,  the  department  inspects  the  result  of  the  project  and  gives  a  “pass” 
or  “fail”  rating.  A  failed  project  must  be  re-inspected  until  it  receives  a  pass 
rating.  The  department  had  been  frustrated  by  the  high  cost  of  re-inspection 
and  decided  to  publish  the  inspection  records  of  all  contractors  on  the  web. 

It  was  hoped  that  public  access  to  the  records  would  lower  the  re-inspection 
rate.  A  year  after  the  web  access  was  made  public,  two  samples  of  records 
were  randomly  selected.  One  sample  was  selected  from  the  pool  of  records 
before  the  web  publication  and  one  after.  The  proportion  of  projects  that 
passed  on  the  first  inspection  was  noted  for  each  sample.  The  results  are 
summarized  below.  Construct  a  point  estimate  and  a  90%  confidence 
interval  for  the  difference  in  the  passing  rate  on  first  inspection  between  the 
two  time  periods. 


No  public  web  access 

m  =  500  pi  =  0.67 

Public  web  access 

n2  =  100  p2  =  0.80 

Solution: 

The  point  estimate  of p  j  —  p2  is 


Pi-p2  =  0.67  -  0.80  =  -0.13 

Because  the  “No  public  web  access”  population  was  labeled  as  Population  1 
and  the  “Public  web  access”  population  was  labeled  as  Population  2,  in 
words  this  means  that  we  estimate  that  the  proportion  of  projects  that 
passed  on  the  first  inspection  increased  by  13  percentage  points  after 
records  were  posted  on  the  web. 

The  sample  sizes  are  sufficiently  large  for  constructing  a  confidence  interval 
since  for  sample  1: 


P 1  (1  ~Pl) 
n\ 


=  3 


(0.67)  (0.33) 
500 


=  0.06 


so  that 
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-  ,  ,Pi  (! -Pi)  ~  „  Ip  1  (!  -ri) 

Pi~3v - - - »ri  +3 


«1 


Hi 


=  [0.67  -0.06,0.67  +  0.06]=  [0.61,0.73]  c  [0,1 

and  for  sample  2: 


3  JhilZlA  =  3  =  0.12 


«1 


100 


so  that 


-  ,  ,Pl(l-P2)  „  ,  ,  I  Pi  (1- Pi) 

P2-1  V - .  /?2+3 


«2 


=  [0.8-0.12,0.8  +  0.12]=  [0.68,0.92]  c  [0,l] 

To  apply  the  formula  for  the  confidence  interval,  we  first  observe  that  the 
90%  confidence  level  means  that  a  =  1  —  0.90  =  0.10  so  that 
Za/ 2  —  Co.05  •  From  Figure  12.3  "Critical  Values  of"  we  read  directly  that 
£0.05  —  1 .645.  Thus  the  desired  confidence  interval  is 


(Pi  -P2)  ± 


Pi  (! -Pi)  P2  (1  -P2) 

Za/ 2  1  / - 1- 


n  1 


n2 


=  -0.13  +  1.645 


(0.67)  (0.33)  (0.8)  (0.2) 


500 

-0.13  +  0.07 


+ 


100 


The  90%  confidence  interval  is  [—0.20,  —0.06  ]  .  We  are  90%  confident 
that  the  difference  in  the  population  proportions  lies  in  the  interval 
[—0.20,  —0.06  ]  ,  in  the  sense  that  in  repeated  sampling  90%  of  all 
intervals  constructed  from  the  sample  data  in  this  manner  will  contain 
p  j  —  P2  ■  Taking  into  account  the  labeling  of  the  two  populations,  this 
means  that  we  are  90%  confident  that  the  proportion  of  projects  that  pass  on 


9.4  Comparison  of  Two  Population  Proportions 


504 


Chapter  9  Two-Sample  Problems 


the  first  inspection  is  between  6  and  20  percentage  points  higher  after 
public  access  to  the  records  than  before. 


Hypothesis  Testing 

In  hypothesis  tests  concerning  the  relative  sizes  of  the  proportions  p \  and  p2  of  two 
populations  that  possess  a  particular  characteristic,  the  null  and  alternative 
hypotheses  will  always  be  expressed  in  terms  of  the  difference  of  the  two 
population  proportions.  Hence  the  null  hypothesis  is  always  written 

Ho  :  Pi  -p2  =  D0 

The  three  forms  of  the  alternative  hypothesis,  with  the  terminology  for  each  case, 
are: 


Form  of  Ha 

Terminology 

Ha  ■  Pi  -p2  <  Do 

Left-tailed 

Ha  :  Pi  ~P2  >  Do 

Right-tailed 

Ha  :  Pi  -  P2  #  Do 

Two-tailed 

As  long  as  the  samples  are  independent  and  both  are  large  the  following  formula  for 
the  standardized  test  statistic  is  valid,  and  it  has  the  standard  normal  distribution. 
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Standardized  Test  Statistic  for  Hypothesis  Tests 
Concerning  the  Difference  Between  Two  Population 
Proportions 

„  (P\  -  Pi)-  °o 


Vpi(i-pi)  Piji-Pi) 

«i  n2 

The  test  statistic  has  the  standard  normal  distribution. 

The  samples  must  be  independent,  and  each  sample  must  be  large:  each  of  the 
intervals 


P 1-3-1/ - ,  Pi  +  3 


n  i 


n  i 


and 


-  a  tPlQ-Pl)  -  ,a  \P2O~-P2) 
P 2  -  3  1/  - >  Pi  +  3 


n2 


n  2 


must  lie  wholly  within  the  interval  [0, 1  ]  . 
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EXAMPLE  11 


Using  the  data  of  Note  9.25  "Example  10",  test  whether  there  is  sufficient 
evidence  to  conclude  that  public  web  access  to  the  inspection  records  has 
increased  the  proportion  of  projects  that  passed  on  the  first  inspection  by 
more  than  5  percentage  points.  Use  the  critical  value  approach  at  the  10% 
level  of  significance. 

Solution: 

•  Step  1.  Taking  into  account  the  labeling  of  the  populations  an 
increase  in  passing  rate  at  the  first  inspection  by  more  than  5 
percentage  points  after  public  access  on  the  web  may  be 
expressed  as  p2  >  p  \  +  0.05,  which  by  algebra  is  the  same  as 
Pl  —p 2  <  —0.05.  This  is  the  alternative  hypothesis.  Since 
the  null  hypothesis  is  always  expressed  as  an  equality,  with  the 
same  number  on  the  right  as  is  in  the  alternative  hypothesis,  the 
test  is 


Hq  :  Pi  -p2  =  -0.05 
vs.Ha  :  pi  -  p2  <  -0.05  @  a  =  0.10 

•  Step  2.  Since  the  test  is  with  respect  to  a  difference  in  population 
proportions  the  test  statistic  is 

^  (Pi  -Pi)  “A) 


P\{l~Pi)  Pii'-Pi) 


n  i 


n  2 


•  Step  3.  Inserting  the  values  given  in  Note  9.25  "Example  10"  and 
the  value  Dq  =  —0.05  into  the  formula  for  the  test  statistic 
gives 
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z  = 


(pi  -  Pi)  -  A 


(-0.13)-  (-0.05) 


=  -1.770 


Pl{l~Pl)  P2  (  1  — /?2  ) 


n  i 


«2 


(0.67)  (0.33)  +  (0.8)(0.2) 


500 


100 


•  Step  4.  Since  the  symbol  in  Ha  is  “<”  this  is  a  left-tailed  test,  so  there  is  a 
single  critical  value,  Za  —  ~ Co.  10  •  From  the  last  row  in  Figure  12.3 
"Critical  Values  of"  Zo  in  =  1.282,  so  — Zo.10  —  —1-282.  The 
rejection  region  is  (  — < go,  -1.282  . 


•  Step  5.  As  shown  in  Figure  9.8  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 


The  data  provide  sufficient  evidence,  at  the  10%  level  of 
significance,  to  conclude  that  the  rate  of  passing  on  the  first 
inspection  has  increased  by  more  than  5  percentage  points  since 
records  were  publicly  posted  on  the  web. 


Figure  9.8 

Rejection  Region  and  Test  Statistic  foi  Note  9.21  "Example  11" 


Ha  :  Pi  -P2  <  -0.05 


—  - 1 -  2 

Reject  Hq  |  0 

Z  =  -1.770 
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EXAMPLE  12 


Perform  the  test  of  Note  9.27  "Example  11"  using  the  p-value  approach. 

Solution: 

The  first  three  steps  are  identical  to  those  in  Note  9.27  "Example  11". 

•  Step  4.  Because  the  test  is  left-tailed  the  observed  significance  or  p-value 
of  the  test  is  just  the  area  of  the  left  tail  of  the  standard  normal 
distribution  that  is  cut  off  by  the  test  statistic  Z  =  —1.770.  From 
Figure  12.2  "Cumulative  Normal  Probability"  the  area  of  the  left  tail 
determined  by  -1.77  is  0.0384.  The  p-value  is  0.0384. 

•  Step  5.  Since  the  p-value  0.0384  is  less  than  a  =  0. 10,  the  decision  is  to 
reject  the  null  hypothesis:  The  data  provide  sufficient  evidence,  at  the 
10%  level  of  significance,  to  conclude  that  the  rate  of  passing  on  the  first 
inspection  has  increased  by  more  than  5  percentage  points  since  records 
were  publicly  posted  on  the  web. 


Finally  a  common  misuse  of  the  formulas  given  in  this  section  must  be  mentioned. 
Suppose  a  large  pre-election  survey  of  potential  voters  is  conducted.  Each  person 
surveyed  is  asked  to  express  a  preference  between,  say,  Candidate  A  and  Candidate 
B.  (Perhaps  “no  preference”  or  “other”  are  also  choices,  but  that  is  not  important.) 
In  such  a  survey,  estimators  pA  and  pB  of  pa  and  ps  can  be  calculated.  It  is 
important  to  realize,  however,  that  these  two  estimators  were  not  calculated  from 
two  independent  samples.  While  pA  —  pB  may  be  a  reasonable  estimator  of 
Pa  ~  Pb >  the  formulas  for  confidence  intervals  and  for  the  standardized  test 
statistic  given  in  this  section  are  not  valid  for  data  obtained  in  this  manner. 


KEY  TAKEAWAYS 


•  A  confidence  interval  for  the  difference  in  two  population  proportions  is 
computed  using  a  formula  in  the  same  fashion  as  was  done  for  a  single 
population  mean. 

•  The  same  five-step  procedure  used  to  test  hypotheses  concerning  a 
single  population  proportion  is  used  to  test  hypotheses  concerning  the 
difference  between  two  population  proportions.  The  only  difference  is 
in  the  formula  for  the  standardized  test  statistic. 
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1.  Construct  the  confidence  interval  for  p  y  —  p2  for  the  level  of  confidence  and 
the  data  given.  (The  samples  are  sufficiently  large.) 

a.  90%  confidence, 

m  =  1670,/?!  =  0.42 
n2  =  900,/?  2  =  0.38 

b.  95%  confidence, 

m  =  600,/?!  =  0.84 

n2  =  420,/?  2  =  0.67 

2.  Construct  the  confidence  interval  for  p^  —  p  2  for  the  level  of  confidence  and 
the  data  given.  (The  samples  are  sufficiently  large.) 

a.  98%  confidence, 

ni  =  750,/?i  =  0.64 

n2  =  800, /?2  =  0.51 

b.  99.5%  confidence, 

m  =  250,^1  =  0.78 
n2  =  250, /?2  =  0.51 

3.  Construct  the  confidence  interval  for  Py  —  p  2  for  the  level  of  confidence  and 
the  data  given.  (The  samples  are  sufficiently  large.) 

a.  80%  confidence, 

m  =  300,/?i  =  0.255 

n2  =  400,/?  2  =  0.193 

b.  95%  confidence, 
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ni  =  3500,px  -  0.147 
n2  =  3750, p2  =  0.131 

4.  Construct  the  confidence  interval  for  p  x  —  p2  for  the  level  of  confidence  and 
the  data  given.  (The  samples  are  sufficiently  large.) 

a.  99%  confidence, 

m  =  2250,/?!  =  0.915 
n2  =  2525, p2  =  0.858 

b.  95%  confidence, 

ni  =  120,/?!  =  0.650 
n2  =  200,/?  2  =  0.505 

5.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  critical 
value  approach.  Compute  the  p-value  of  the  test  as  well.  (The  samples  are 
sufficiently  large.) 

a.  Test Hq  '•  P\  ~  P2  —  Ovs ,Ha  :  px  —  p2  >  0@  a  =  0.10, 
m  =  1200,  Pi  =  0.42 

n2  =  1 200 ,  p2  =  0.40 

b.  Test  Hq  :  px  —  p2  =  Ovs ,Ha  \px  —  p2  ±  0@  a  =  0.05, 
m  =  550, Pi  =  0.61 

n2  =  600  ,p2  =  0.67 

6.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  critical 
value  approach.  Compute  the  p-value  of  the  test  as  well.  (The  samples  are 
sufficiently  large.) 

a.  Test  Hq  :  px  —  p2  =  0.05vs.  Ha  \  px  —  p2  >  0.05@a  =  0.05, 
m  =  1100, Pi  =  0.57 

n2  =  1100,p2  -  0.48 

b.  Test  Hq  :  px  —  p2  =  Ovs  ,Ha  \  px  —  p2  ±  0@  a  =  0.05, 
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ni  =  800, £1  =  0.39 
n2  =  900, p2  =  0.43 

7.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  critical 
value  approach.  Compute  the  p-value  of  the  test  as  well.  (The  samples  are 
sufficiently  large.) 

a.  Test  Hq  :  px  —  p2  =  0.25vs.  Ha  '■  P\  —  P2  <  0.25@ 
a  —  0.005, 

m  =  1400,/?!  -  0.57 
n2  =  1200,/?  2  =  0.37 

b.  Test  Hq  :  px  —  p2  =  0.16vs.  Ha  :  px  —  p2  ±  0.16@a  =  0.02, 
m  =  750 ,px  =  0.43 

n2  =  600,/?  2  =  0.22 

8.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  critical 
value  approach.  Compute  the  p-value  of  the  test  as  well.  (The  samples  are 
sufficiently  large.) 

a.  Test  Hq  \px—p2—  0.08vs.  Ha  '•  P\  ~  P2  >  0.08@ 
a  -  0.025, 

m  =  450,/?j  =  0.67 

n2  =  200,/?  2  =  0.52 

b.  Test  Hq  '•P\—p2—  0.02  vs.  Ha  \  px  —  p2  ±  0.02@ 
a  —  0.001 , 

m  =  2700,/?i  =  0.837 
n2  =  2900,/?  2  =  0.854 

9.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  p-value 
approach.  (The  samples  are  sufficiently  large.) 

a.  Test  Hq  :  px  —  p2  =  Ovs ,Ha  \  px  —  p2  <  0@  a  =  0.005, 

ni  =  1100, Pi  -  0.22 
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n2  =  1300, 2  =  0.27 

b.  Test Hq  :  px  —  p2  =  Ovs ,Ha  :  px  —  p2  ^  0 @a  =  0.01, 
nx  =  650,  px  =  0.35 
n2  =  650, p2  =  0.41 

10.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  p-value 
approach.  (The  samples  are  sufficiently  large.) 

a.  Test  Hq  :  px  —  p2  =  0.15vs .  Ha  \  p{  —  p2  >  0.15@a  =  0.10, 
m  =  950 ,px  =  0.41 

n2  =  500, p2  =  0.23 

b.  Test  Hq  :  p{  —  p2  =  0A0vs.Ha  :  p{  —  p2  ^  0.10@a  =  0.10, 
n\  =  220  ,p1  =  0.92 

n2  =  160,^  =  0.78 

11.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  p-value 
approach.  (The  samples  are  sufficiently  large.) 

a.  Test  Hq  :  px  —  p2  =  0.22 vs ,Ha  \  px  —  p2>  0.22 @a  =  0.05, 
ni  =  90,  Pi  =  0.72 

n2  =  15,p2  =  0.40 

b.  Test  Hq  \  px  —  p2  =  031vs.Ha  \  px  —  p2  ±  0.37 @  a  =  0.02, 
m  =  425,;?!  =  0.772 

n2  =  425, p2  =  0.331 

12.  Perform  the  test  of  hypotheses  indicated,  using  the  data  given.  Use  the  p-value 
approach.  (The  samples  are  sufficiently  large.) 

a.  Test  Hq  '■  P\~P2—  0.50vs.  Ha  \  px  —  p2  <  0.50@  a  =  0.10, 
nx  =  40,  p  |  =  0.65 
n2  =  55, p2  =  0.24 
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b.  Test Hq  :  py  —  p2  =  0.30vs .  Ha  :  px  —  p2  ±  0.30@a  =  0.10, 
m  =  7500  ,p{  =  0.664 
n2  =  1000,^  -  0.319 


APPLICATIONS 


In  all  the  remaining  exercsises  the  samples  are  sufficiently  large  (so  this 
need  not  be  checked). 

13.  Voters  in  a  particular  city  who  identify  themselves  with  one  or  the  other  of 
two  political  parties  were  randomly  selected  and  asked  if  they  favor  a  proposal 
to  allow  citizens  with  proper  license  to  carry  a  concealed  handgun  in  city 
parks.  The  results  are: 


Party  A 

Party  B 

Sample  size,  n 

150 

200 

Number  in  favor,  x 

90 

140 

a.  Give  a  point  estimate  for  the  difference  in  the  proportion  of  all  members  of 
Party  A  and  all  members  of  Party  B  who  favor  the  proposal. 

b.  Construct  the  95%  confidence  interval  for  the  difference,  based  on  these 
data. 

c.  Test,  at  the  5%  level  of  significance,  the  hypothesis  that  the  proportion  of 
all  members  of  Party  A  who  favor  the  proposal  is  less  than  the  proportion 
of  all  members  of  Party  B  who  do. 

d.  Compute  the  p-value  of  the  test. 

14.  To  investigate  a  possible  relation  between  gender  and  handedness,  a  random 
sample  of  320  adults  was  taken,  with  the  following  results: 


Men 

Women 

Sample  size,  n 

168 

152 

Number  of  left-handed,  x 

24 

9 

a.  Give  a  point  estimate  for  the  difference  in  the  proportion  of  all  men  who 
are  left-handed  and  the  proportion  of  all  women  who  are  left-handed. 

b.  Construct  the  95%  confidence  interval  for  the  difference,  based  on  these 
data. 
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c.  Test,  at  the  5%  level  of  significance,  the  hypothesis  that  the  proportion  of 
men  who  are  left-handed  is  greater  than  the  proportion  of  women  who 
are. 

d.  Compute  the  p-value  of  the  test. 

15.  A  local  school  board  member  randomly  sampled  private  and  public  high  school 
teachers  in  his  district  to  compare  the  proportions  of  National  Board  Certified 
(NBC)  teachers  in  the  faculty.  The  results  were: 


Private 

Schools 

Public 

Schools 

Sample  size,  n 

80 

520 

Proportion  of  NBC  teachers, 

P 

0.175 

0.150 

a.  Give  a  point  estimate  for  the  difference  in  the  proportion  of  all  teachers  in 
area  public  schools  and  the  proportion  of  all  teachers  in  private  schools 
who  are  National  Board  Certified. 

b.  Construct  the  90%  confidence  interval  for  the  difference,  based  on  these 
data. 

c.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  proportion  of 
all  public  school  teachers  who  are  National  Board  Certified  is  less  than  the 
proportion  of  private  school  teachers  who  are. 

d.  Compute  the  p-value  of  the  test. 

16.  In  professional  basketball  games,  the  fans  of  the  home  team  always  try  to 
distract  free  throw  shooters  on  the  visiting  team.  To  investigate  whether  this 
tactic  is  actually  effective,  the  free  throw  statistics  of  a  professional  basketball 
player  with  a  high  free  throw  percentage  were  examined.  During  the  entire 
last  season,  this  player  had  656  free  throws,  420  in  home  games  and  236  in 
away  games.  The  results  are  summarized  below. 


Home 

Away 

Sample  size,  n 

420 

236 

Free  throw  percent,/? 

81.5% 

78.8% 

a.  Give  a  point  estimate  for  the  difference  in  the  proportion  of  free  throws 
made  at  home  and  away. 

b.  Construct  the  90%  confidence  interval  for  the  difference,  based  on  these 
data. 

c.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  there  exists  a 
home  advantage  in  free  throws. 

d.  Compute  the  p-value  of  the  test. 
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17.  Randomly  selected  middle-aged  people  in  both  China  and  the  United  States 
were  asked  if  they  believed  that  adults  have  an  obligation  to  financially 
support  their  aged  parents.  The  results  are  summarized  below. 


China 

USA 

Sample  size,  n 

1300 

150 

Number  of  yes,  x 

1170 

110 

Test,  at  the  1%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  there  exists  a  cultural  difference  in  attitude 
regarding  this  question. 

18.  A  manufacturer  of  walk-behind  push  mowers  receives  refurbished  small 

engines  from  two  new  suppliers,  A  and  B.  It  is  not  uncommon  that  some  of  the 
refurbished  engines  need  to  be  lightly  serviced  before  they  can  be  fitted  into 
mowers.  The  mower  manufacturer  recently  received  100  engines  from  each 
supplier.  In  the  shipment  from  A,  13  needed  further  service.  In  the  shipment 
from  B,  10  needed  further  service.  Test,  at  the  10%  level  of  significance, 
whether  the  data  provide  sufficient  evidence  to  conclude  that  there  exists  a 
difference  in  the  proportions  of  engines  from  the  two  suppliers  needing 
service. 


LARGE  DATA  SET  EXERCISES 


19.  Large  Data  Sets  6A  and  6B  record  results  of  a  random  survey  of  200  voters  in 
each  of  two  regions,  in  which  they  were  asked  to  express  whether  they  prefer 
Candidate  A  for  a  U.S.  Senate  seat  or  prefer  some  other  candidate.  Let  the 
population  of  all  voters  in  region  1  be  denoted  Population  1  and  the  population 
of  all  voters  in  region  2  be  denoted  Population  2.  Let  pi  be  the  proportion  of 
voters  in  Population  1  who  prefer  Candidate  A,  and  p2  the  proportion  in 
Population  2  who  do. 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ data6A.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/data6B.xls 

a.  Find  the  relevant  sample  proportions  p  ±  and  p 2  • 

b.  Construct  a  point  estimate  for  p  j  —  p 2 . 

c.  Construct  a  95%  confidence  interval  for  p  j  —  p 2 . 

d.  Test,  at  the  5%  level  of  significance,  the  hypothesis  that  the  same 
proportion  of  voters  in  the  two  regions  favor  Candidate  A,  against  the 
alternative  that  a  larger  proportion  in  Population  2  do. 
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20.  Large  Data  Set  11  records  the  results  of  samples  of  real  estate  sales  in  a  certain 
region  in  the  year  2008  (lines  2  through  536)  and  in  the  year  2010  (lines  537 
through  1106).  Foreclosure  sales  are  identified  with  a  1  in  the  second  column. 
Let  all  real  estate  sales  in  the  region  in  2008  be  Population  1  and  all  real  estate 
sales  in  the  region  in  2010  be  Population  2. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datall.xls 

a.  Use  the  sample  data  to  construct  point  estimates  p  j  and  p  2  of  the 
proportions  pi  and  p2  of  all  real  estate  sales  in  this  region  in  2008  and  2010 
that  were  foreclosure  sales.  Construct  a  point  estimate  of  p±  —  p 2 . 

b.  Use  the  sample  data  to  construct  a  90%  confidence  for  p  j  —  p2  • 

c.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  proportion  of 
real  estate  sales  in  the  region  in  2010  that  were  foreclosure  sales  was 
greater  than  the  proportion  of  real  estate  sales  in  the  region  in  2008  that 
were  foreclosure  sales.  (The  default  is  that  the  proportions  were  the  same.) 
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ANSWERS 


i.  a.  (0.0068,0.0732)  , 

b.  (0.1163,0.2237) 

3.  a.  (0.0210,0. 1030)  , 

b.  (0.0001,0.0319) 

5.  a.  Z  =  0.996,  Zo.10  =  1.282,;? -value  =  0.1587  ,  do  not  reject  Ho, 

b.  Z  =  -2.120 ,  ±zo.025  =  ±  1. 960, p  -value  =  0.0340  ,  reject  ho 

7.  a.  Z  =  -2.602 ,  -zo.oo5  —  -2. 576, p  -value  =  0.0047  ,  reject  ho, 
b.  Z  =  2.020,  ±Zo.01  —  ±2.326  ,p  -value  =  0.0434  ,  do  not  reject  Ho 

9.  a.  Z  =  -2. 85,  P -value  =  0.0022  ,  reject  Ho, 

b.  Z  =  —  2.23  ,p -value  =  0.0258  ,  do  not  reject  Ho 

11.  a.  Z  =  1.36,  /? -value  =  0.0869  ,  do  not  reject  Ho, 
b.  Z  =  2.32, ;? -value  =  0.0204  ,  do  not  reject  Ho 

13.  a.  -0.10, 

b.  -0.10  ±0.101  , 

c.  Z  =  —  1 .943  ,  — Zo.05  —  —  1  -645  ,  reject  Ho  (fewer  in  Party  A  favor), 

d.  p-value  =  0.0262 

15.  a.  0.025, 

b.  0.025  ±  0.0745  , 

c.  Z  =  0.552,  Zo.10  —  1-282,  do  not  reject  Ho  (as  many  public  school 
teachers  are  certified), 

d.  p-value  =  0.2912 

17.  Z  =  4.498,  ±Zo.005  —  ±2.576 ,  reject  Ho  (different) 

19.  a.  P!  =  0.355  and 7? 2  =  0.41 

b.  pl-p2  =  -0.055 
C.  (-0.1501,0.0401) 

d.  Ho  :  7?j  —  7? 2  =  Ovs.  Ha  :  P\  —  7?2  <  O.Test  Statistic: 

Z  =  —1.1335.  Rejection  Region:  (—  go,—  1.645  j  .  Decision:  Fail  to 
reject  Hq. 


9.4  Comparison  of  Two  Population  Proportions 


518 


Chapter  9  Two-Sample  Problems 


9.5  Sample  Size  Considerations 


LEARNING  OBJECTIVE 

1.  To  learn  how  to  apply  formulas  for  estimating  the  size  samples  that  will 
be  needed  in  order  to  construct  a  confidence  interval  for  the  difference 
in  two  population  means  or  proportions  that  meets  given  criteria. 


As  was  pointed  out  at  the  beginning  of  Section  7.4  "Sample  Size  Considerations"  in 
Chapter  7  "Estimation",  sampling  is  typically  done  with  definite  objectives  in  mind. 
For  example,  a  physician  might  wish  to  estimate  the  difference  in  the  average 
amount  of  sleep  gotten  by  patients  suffering  a  certain  condition  with  the  average 
amount  of  sleep  got  by  healthy  adults,  at  90%  confidence  and  to  within  half  an  hour. 
Since  sampling  costs  time,  effort,  and  money,  it  would  be  useful  to  be  able  to 
estimate  the  smallest  size  samples  that  are  likely  to  meet  these  criteria. 

Estimating  /U  ^  —  //2  with  Independent  Samples 

Assuming  that  large  samples  will  be  required,  the  confidence  interval  formula  for 
estimating  the  difference  H\  —  between  two  population  means  using 
independent  samples  is  (Jj  —  x 2)  ±E,  where 


To  say  that  we  wish  to  estimate  the  mean  to  within  a  certain  number  of  units  means 
that  we  want  the  margin  of  error  E  to  be  no  larger  than  that  number.  The  number 
Za/2  is  determined  by  the  desired  level  of  confidence. 


The  numbers  si  and  S2  are  estimates  of  the  standard  deviations  0\  and  <72  of  the  two 
populations.  In  analogy  with  what  we  did  in  Section  7.4  "Sample  Size 
Considerations"  in  Chapter  7  "Estimation"  we  will  assume  that  we  either  know  or 
can  reasonably  approximate  o\  and  02  ■ 


We  cannot  solve  for  both  ni  and  «2,  so  we  have  to  make  an  assumption  about  their 
relative  sizes.  We  will  specify  that  they  be  equal.  With  these  assumptions  we  obtain 
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the  minimum  sample  sizes  needed  by  solving  the  equation  displayed  just  above  for 
n\  -  n2. 


Minimum  Equal  Sample  Sizes  for  Estimating  the 
Difference  in  the  Means  of  Two  Populations  Using 
Independent  Samples 


The  estimated  minimum  equal  sample  sizes  ri\  —  n2  needed  to  estimate  the 
difference  ^  —  /r?  in  two  population  means  to  within  E  units  at  100  (1  —  af/o 
confidence  is 


n  i  =  n2 


(Za/2?(.ol  + 
E2 


(rounded  up) 


In  all  the  examples  and  exercises  the  population  standard  deviations  a\  and  o2  will 
be  given. 
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EXAMPLE  13 


A  law  firm  wishes  to  estimate  the  difference  in  the  mean  delivery  time  of 
documents  sent  between  two  of  its  offices  by  two  different  courier 
companies,  to  within  half  an  hour  and  with  99.5%  confidence.  From  their 
records  it  will  randomly  sample  the  same  number  n  of  documents  as 
delivered  by  each  courier  company.  Determine  how  large  n  must  be  if  the 
estimated  standard  deviations  of  the  delivery  times  are  0.75  hour  for  one 
company  and  1.15  hours  for  the  other. 

Solution: 


Confidence  level  99.5%  means  that  a  =  1  —  0.995  =  0.005  so 
a  /  2  =  0.0025.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Cq.0025  =  2.807. 


To  say  that  the  estimate  is  to  be  “to  within  half  an  hour”  means  that  £  =  0.5. 
Thus 


n  = 


(za/2)2  {o\  +<t22) 
E2 


(2.807) 2  (0.752  4-  1.152 ) 
0.52 


59.40953746 


which  we  round  up  to  60,  since  it  is  impossible  to  take  a  fractional 
observation.  The  law  firm  must  sample  60  document  deliveries  by  each 
company. 


Estimating  ]U  ^  —  El  with  Paired  Samples 

As  we  mentioned  at  the  end  of  Section  9.3  "Comparison  of  Two  Population  Means: 
Paired  Samples",  if  the  sample  is  large  (meaning  that  n  2  30)  then  in  the  formula  for 
the  confidence  interval  we  may  replace  ta/2  by  Z  a/2 ,  so  that  the  confidence  interval 
formula  becomes  d  ±  E  for 


E  =  Za/2 


Sd 


The  number  sd  is  an  estimate  of  the  standard  deviations  Od  of  the  population  of 
differences.  We  must  assume  that  we  either  know  or  can  reasonably  approximate 
Od-  Thus,  assuming  that  large  samples  will  be  required  to  meet  the  criteria  given, 


9.5  Sample  Size  Considerations 


521 


Chapter  9  Two-Sample  Problems 


we  can  solve  the  displayed  equation  for  n  to  obtain  an  estimate  of  the  number  of 
pairs  needed  in  the  sample. 


Minimum  Sample  Size  for  Estimating  the  Difference  in 
the  Means  of  Two  Populations  Using  Paired  Difference 
Samples 

The  estimated  minimum  number  of  pairs  n  needed  to  estimate  the  difference 
jid  —  li\  ~  H2  in  two  population  means  to  within  E  units  at  100  (1  —  af/o 
confidence  using  paired  difference  samples  is 

{Zanfoj 

n  =  - - -  (rounded  up) 

E~ 


In  all  the  examples  and  exercises  the  population  standard  deviation  of  the 
differences  Od  will  be  given. 
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EXAMPLE  14 


A  automotive  tire  manufacturer  wishes  to  compare  the  mean  lifetime  of  two 
tread  designs  under  actual  driving  conditions.  They  will  mount  one  of  each 
type  of  tire  on  n  vehicles  (both  on  the  front  or  both  on  the  back)  and 
measure  the  difference  in  remaining  tread  after  20,000  miles  of  driving,  if 
the  standard  deviation  of  the  differences  is  assumed  to  be  0.025  inch,  find 
the  minimum  samples  size  needed  to  estimate  the  difference  in  mean  depth 
(at  20,000  miles  use)  to  within  0.01  inch  at  99.9%  confidence. 

Solution: 


Confidence  level  99.9%  means  that  (X  =  1  —  0.999  =  0.001  so 
a  I  2  =  0.0005.  From  the  last  line  of  Figure  12.3  "Critical  Values  of  "  we 
obtain  Zo.0005  =  3.291. 


To  say  that  the  estimate  is  to  be  “to  within  0.01  inch”  means  that  £  =  0.01. 
Thus 


( Zg/ifa}  _  (3.291)2 (0.025) 2 
E 2  ~  (0.01)2 


67.69175625 


which  we  round  up  to  68.  The  manufacturer  must  test  68  pairs  of  tires. 


Estimating P\  ~  P2 

The  confidence  interval  formula  for  estimating  the  difference  P\  —  p2  between  two 
population  proportions  is  p^  —  p2  ±  E,  where 


E  —  Za/2 


Pi  (1  "Pi)  +  P2  (1  -Pi) 


n  1 


n  2 


To  say  that  we  wish  to  estimate  the  mean  to  within  a  certain  number  of  units  means 
that  we  want  the  margin  of  error  £  to  be  no  larger  than  that  number.  The  number 
Za/2  is  determined  by  the  desired  level  of  confidence. 
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We  cannot  solve  for  both  m  and  n2l  so  we  have  to  make  an  assumption  about  their 
relative  sizes.  We  will  specify  that  they  be  equal.  With  these  assumptions  we  obtain 
the  minimum  sample  sizes  needed  by  solving  the  displayed  equation  for  n\  —  n2 . 


Minimum  Equal  Sample  Sizes  for  Estimating  the 
Difference  in  Two  Population  Proportions 


The  estimated  minimum  equal  sample  sizes  ri\  —  m  needed  to  estimate  the 
difference  P\  —  p2  in  two  population  proportions  to  within  E  percentage  points 
at  100  (1  —  (Xf/o  confidence  is 


n\  =  n2  = 


( Za/2 )2  (Pi  (1  ~Pi)  +P2  (!  -Pi)) 


(rounded  up) 


Here  we  face  the  same  dilemma  that  we  encountered  in  the  case  of  a  single 
population  proportion:  the  formula  for  estimating  how  large  a  sample  to  take 
contains  the  numbers  pl  and  p2 ,  which  we  know  only  after  we  have  taken  the 
sample.  There  are  two  ways  out  of  this  dilemma.  Typically  the  researcher  will  have 
some  idea  as  to  the  values  of  the  population  proportions  p \  and  p2,  hence  of  what 

the  sample  proportions  p^  and  p2  are  likely  to  be.  if  so,  those  estimates  can  be  used 
in  the  formula. 


The  second  approach  to  resolving  the  dilemma  is  simply  to  replace  each  of  p  j  and 
p2  in  the  formula  by  0.5.  As  in  the  one-population  case,  this  is  the  most 
conservative  estimate,  since  it  gives  the  largest  possible  estimate  of  n.  if  we  have 
an  estimate  of  only  one  of  p i  and  p2  we  can  use  that  estimate  for  it,  and  use  the 
conservative  estimate  0.5  for  the  other. 
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EXAMPLE  15 


Find  the  minimum  equal  sample  sizes  necessary  to  construct  a  98% 
confidence  interval  for  the  difference  p  \  ~  p 2  with  a  margin  of  error  E  = 
0.05, 

a.  assuming  that  no  prior  knowledge  about  p  1  or  p2  is  available;  and 

b.  assuming  that  prior  studies  suggest  that  p  j  ~0.2and/?2  ~  0.3. 

Solution: 

Confidence  level  98%  means  that  a  =  1  —  0.98  =  0.02  so 
a  /  2  =  0.01.  From  the  last  line  of  Figure  12.3  "Critical  Values  of"  we 

obtain  Co.01  —  2.326. 


a.  Since  there  is  no  prior  knowledge  of  p  1  or  p2  we  make  the  most 
conservative  estimate  that p  j  =  0.5  and p 2  =  0.5.  Then 


{Za/l)2  (Pi  (1  -Pi)  +P2  (1  -Pi)) 


(2.326)  2  ((0.5)  (0.5)+  (0.5)  (0.5)) 


0.052 

=  1082.0552 

which  we  round  up  to  1,083.  We  must  take  a  sample  of  size  1,083 
from  each  population. 

b.  Since £8  0.2  we  estimate  p  j  by  0.2,  and  since P2  ~  0.3  we 
estimate  p 2  by  0.3.  Thus  we  obtain 
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(Za/l)  (Pi  (1  -Pi)  +P2  (1  -Pi)) 

ni  =  n2  =  - | - 

E2 

(2.326)  2  ((0.2)  (0.8)  +  (0.3)  (0.7)) 

0.052 

=  800.720848 

which  we  round  up  to  801.  We  must  take  a  sample  of  size  801 
from  each  population. 


KEY  TAKEAWAYS 


•  if  the  population  standard  deviations  O i  and  <72  are  known  or  can  be 
estimated,  then  the  minimum  equal  sizes  of  independent  samples 
needed  to  obtain  a  confidence  interval  for  the  difference  [A  j  —  /12  in 
two  population  means  with  a  given  maximum  error  of  the  estimate  E 
and  a  given  level  of  confidence  can  be  estimated. 

•  If  the  standard  deviation  of  the  population  of  differences  in  pairs 
drawn  from  two  populations  is  known  or  can  be  estimated,  then  the 
minimum  number  of  sample  pairs  needed  under  paired  difference 
sampling  to  obtain  a  confidence  interval  for  the  difference 

[Ad  —  ~  in  two  population  means  with  a  given  maximum  error 

of  the  estimate  E  and  a  given  level  of  confidence  can  be  estimated. 

•  The  minimum  equal  sample  sizes  needed  to  obtain  a  confidence  interval 
for  the  difference  in  two  population  proportions  with  a  given  maximum 
error  of  the  estimate  and  a  given  level  of  confidence  can  always  be 
estimated,  if  there  is  prior  knowledge  of  the  population  proportions  p i 
and  p2  then  the  estimate  can  be  sharpened. 
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1.  Estimate  the  common  sample  size  n  of  equally  sized  independent  samples 
needed  to  estimate  [A  y  —  P2  as  specified  when  the  population  standard 
deviations  are  as  shown. 

a.  90%  confidence,  to  within  3  units,  0\  =  10  and  <72  =  7 

b.  99%  confidence,  to  within  4  units,  0\  =6.8  and  <72  =9.3 

c.  95%  confidence,  to  within  5  units,  <7 1  =  22.6  and  <72  =  31.8 

2.  Estimate  the  common  sample  size  n  of  equally  sized  independent  samples 
needed  to  estimate  jU  y  —  [A. 2  as  specified  when  the  population  standard 
deviations  are  as  shown. 

a.  80%  confidence,  to  within  2  units,  0\  =  14  and  <72  =23 

b.  90%  confidence,  to  within  0.3  units,  <7 1  =  1.3  and  (72  =0.8 

c.  99%  confidence,  to  within  11  units,  <7i  =42  and  (72  =37 

3.  Estimate  the  number  n  of  pairs  that  must  be  sampled  in  order  to  estimate 

Pd  —  /A  1  —  /A  2^s  specified  when  the  standard  deviation  sd  of  the  population 
of  differences  is  as  shown. 

a.  80%  confidence,  to  within  6  units,  <7j  =  26.5 

b.  95%  confidence,  to  within  4  units,  <7<f  =  12 

c.  90%  confidence,  to  within  5.2  units,  (7j  =  11.3 

4.  Estimate  the  number  n  of  pairs  that  must  be  sampled  in  order  to  estimate 

/ Ad  —  /A  1  ~  jA 2  as  specified  when  the  standard  deviation  sd  of  the  population 
of  differences  is  as  shown. 

a.  90%  confidence,  to  within  20  units,  <7j  =  75.5 

b.  95%  confidence,  to  within  11  units,  <7j  =  31.4 

c.  99%  confidence,  to  within  1.8  units,  (7j  =  4 

5.  Estimate  the  minimum  equal  sample  sizes  1A\  =  U2  necessary  in  order  to 
estimate  p  j  —  as  specified. 

a.  80%  confidence,  to  within  0.05  (five  percentage  points) 

a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  p  j  ~  0.20  and ~  0.65 

b.  90%  confidence,  to  within  0.02  (two  percentage  points) 
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a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  p  j  ~  0.75  and/?2  ~  0.63 

c.  95%  confidence,  to  within  0.10  (ten  percentage  points) 

a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  R  0. 1 1  and p 2  R  0.37 

6.  Estimate  the  minimum  equal  sample  sizes  n  \  =  Uj  necessary  in  order  to 
estimate  —  p2  as  specified. 

a.  80%  confidence,  to  within  0.02  (two  percentage  points) 

a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  p  j  &  0.78  and p2  «  0.65 

b.  90%  confidence,  to  within  0.05  (two  percentage  points) 

a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  Py  R  0. 12  and /?2  ~  0.24 

c.  95%  confidence,  to  within  0.10  (ten  percentage  points) 

a.  when  no  prior  knowledge  of  pi  or  p2  is  available 

b.  when  prior  studies  indicate  that  Py  R  0. 14  and /?2  0.21 


APPLICATIONS 


7.  An  educational  researcher  wishes  to  estimate  the  difference  in  average  scores 
of  elementary  school  children  on  two  versions  of  a  100-point  standardized  test, 
at  99%  confidence  and  to  within  two  points.  Estimate  the  minimum  equal 
sample  sizes  necessary  if  it  is  known  that  the  standard  deviation  of  scores  on 
different  versions  of  such  tests  is  4.9. 

8.  A  university  administrator  wishes  to  estimate  the  difference  in  mean  grade 
point  averages  among  all  men  affiliated  with  fraternities  and  all  unaffiliated 
men,  with  95%  confidence  and  to  within  0.15.  It  is  known  from  prior  studies 
that  the  standard  deviations  of  grade  point  averages  in  the  two  groups  have 
common  value  0.4.  Estimate  the  minimum  equal  sample  sizes  necessary  to 
meet  these  criteria. 

9.  An  automotive  tire  manufacturer  wishes  to  estimate  the  difference  in  mean 
wear  of  tires  manufactured  with  an  experimental  material  and  ordinary 
production  tire,  with  90%  confidence  and  to  within  0.5  mm.  To  eliminate 
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extraneous  factors  arising  from  different  driving  conditions  the  tires  will  be 
tested  in  pairs  on  the  same  vehicles.  It  is  known  from  prior  studies  that  the 
standard  deviations  of  the  differences  of  wear  of  tires  constructed  with  the  two 
kinds  of  materials  is  1.75  mm.  Estimate  the  minimum  number  of  pairs  in  the 
sample  necessary  to  meet  these  criteria. 

10.  To  assess  to  the  relative  happiness  of  men  and  women  in  their  marriages,  a 
marriage  counselor  plans  to  administer  a  test  measuring  happiness  in 
marriage  to  n  randomly  selected  married  couples,  record  the  their  test  scores, 
find  the  differences,  and  then  draw  inferences  on  the  possible  difference.  Let 
p  j  and  yW2  be  the  true  average  levels  of  happiness  in  marriage  for  men  and 
women  respectively  as  measured  by  this  test.  Suppose  it  is  desired  to  find  a 
90%  confidence  interval  for  estimating  =  Py  —  //2to  within  two  test 
points.  Suppose  further  that,  from  prior  studies,  it  is  known  that  the  standard 
deviation  of  the  differences  in  test  scores  is  (7j  ~  10.  What  is  the  minimum 
number  of  married  couples  that  must  be  included  in  this  study? 

11.  A  journalist  plans  to  interview  an  equal  number  of  members  of  two  political 
parties  to  compare  the  proportions  in  each  party  who  favor  a  proposal  to  allow 
citizens  with  a  proper  license  to  carry  a  concealed  handgun  in  public  parks.  Let 
pi  and  p2  be  the  true  proportions  of  members  of  the  two  parties  who  are  in 
favor  of  the  proposal.  Suppose  it  is  desired  to  find  a  95%  confidence  interval 
for  estimating Py  —  p 2  to  within  0.05.  Estimate  the  minimum  equal  number 
of  members  of  each  party  that  must  be  sampled  to  meet  these  criteria. 

12.  A  member  of  the  state  board  of  education  wants  to  compare  the  proportions  of 
National  Board  Certified  (NBC)  teachers  in  private  high  schools  and  in  public 
high  schools  in  the  state.  His  study  plan  calls  for  an  equal  number  of  private 
school  teachers  and  public  school  teachers  to  be  included  in  the  study.  Let  pi 
and  p2  be  these  proportions.  Suppose  it  is  desired  to  find  a  99%  confidence 
interval  that  estimates  p  j  —  Pj  to  within  0.05. 

a.  Supposing  that  both  proportions  are  known,  from  a  prior  study,  to  be 
approximately  0.15,  compute  the  minimum  common  sample  size  needed. 

b.  Compute  the  minimum  common  sample  size  needed  on  the  supposition 
that  nothing  is  known  about  the  values  of  pi  and  p2- 
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ANSWERS 


a. 

n\ 

= 

n2 

— 

45, 

b. 

n\ 

= 

n2 

— 

56. 

c. 

n\ 

ZZ 

n2 

— 

234 

a. 

nx 

= 

n2 

= 

33. 

b. 

n\ 

ZZ 

n2 

= 

35. 

c. 

n\ 

n2 

13 

a. 

a. 

n\ 

— 

n2 

— 

329, 

b. 

n\ 

— 

n2 

— 

255. 

b. 

a. 

n\ 

— 

n2 

— 

3383, 

b. 

n\ 

— 

n2 

= 

2846. 

c. 

a. 

n\ 

— 

n2 

= 

193, 

b. 

n\ 

— 

n2 

— 

128 

7. 

n\ 

— 

n2 

rsj 

80 

9. 

nx 

— 

n2 

34 

11. 

n\ 

— 

n2 

r^j 

769 
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Correlation  and  Regression 


Our  interest  in  this  chapter  is  in  situations  in  which  we  can  associate  to  each 
element  of  a  population  or  sample  two  measurements  x  and  y,  particularly  in  the 
case  that  it  is  of  interest  to  use  the  value  of  x  to  predict  the  value  ofy.  For  example, 
the  population  could  be  the  air  in  automobile  garages,  x  could  be  the  electrical 
current  produced  by  an  electrochemical  reaction  taking  place  in  a  carbon  monoxide 
meter,  andy  the  concentration  of  carbon  monoxide  in  the  air.  In  this  chapter  we 
will  learn  statistical  methods  for  analyzing  the  relationship  between  variables  x  and 
y  in  this  context. 

A  list  of  all  the  formulas  that  appear  anywhere  in  this  chapter  are  collected  in  the 
last  section  for  ease  of  reference. 
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10.1  Linear  Relationships  Between  Variables 


LEARNING  OBJECTIVE 

1.  To  learn  what  it  means  for  two  variables  to  exhibit  a  relationship  that  is 
close  to  linear  but  which  contains  an  element  of  randomness. 


The  following  table  gives  examples  of  the  kinds  of  pairs  of  variables  which  could  be 
of  interest  from  a  statistical  point  of  view. 


X 

y 

Predictor  or  independent  variable 

Response  or  dependent 
variable 

Temperature  in  degrees  Celsius 

Temperature  in  degrees 
Fahrenheit 

Area  of  a  house  (sq.ft.) 

Value  of  the  house 

Age  of  a  particular  make  and  model  car 

Resale  value  of  the  car 

Amount  spent  by  a  business  on  advertising 
in  a  year 

Revenue  received  that  year 

Fieight  of  a  25-year-old  man 

Weight  of  the  man 

The  first  line  in  the  table  is  different  from  all  the  rest  because  in  that  case  and  no 
other  the  relationship  between  the  variables  is  deterministic:  once  the  value  of  x  is 
known  the  value  of  y  is  completely  determined.  In  fact  there  is  a  formula  for  y  in 
terms  of  x:  y  =  j  x  +  32.  Choosing  several  values  for  x  and  computing  the 
corresponding  value  for  y  for  each  one  using  the  formula  gives  the  table 


X 

-40 

-15 

0 

20 

50 

y 

-40 

5 

32 

68 

122 

We  can  plot  these  data  by  choosing  a  pair  of  perpendicular  lines  in  the  plane,  called 
the  coordinate  axes,  as  shown  in  Figure  10.1  "Plot  of  Celsius  and  Fahrenheit 
Temperature  Pairs".  Then  to  each  pair  of  numbers  in  the  table  we  associate  a 
unique  point  in  the  plane,  the  point  that  lies  x  units  to  the  right  of  the  vertical  axis 
(to  the  left  if  x  <  0)  andy  units  above  the  horizontal  axis  (below  if  y  <  0).  The 
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relationship  between  x  and  y  is  called  a  linear  relationship  because  the  points  so 
plotted  all  lie  on  a  single  straight  line.  The  number  j  in  the  equation  y  =  j  x  +  32 
is  the  slope  of  the  line,  and  measures  its  steepness.  It  describes  how  y  changes  in 
response  to  a  change  in  x:  if  x  increases  by  1  unit  theny  increases  (since  j  is 
positive)  by  junit.  if  the  slope  had  been  negative  theny  would  have  decreased  in 
response  to  an  increase  in  x.  The  number  32  in  the  formula  y  —  j  x  +  32 is  they- 
intercept  of  the  line;  it  identifies  where  the  line  crosses  they-axis.  You  may  recall 
from  an  earlier  course  that  every  non-vertical  line  in  the  plane  is  described  by  an 
equation  of  the  form  y  =  mx  +  b,  where  m  is  the  slope  of  the  line  and  b  is  itsy- 
intercept. 


Figure  10.1  Plot  of  Celsius  and  Fahrenheit  Temperature  Pairs 

y 


The  relationship  between  x  andy  in  the  temperature  example  is  deterministic 
because  once  the  value  of  x  is  known,  the  value  ofy  is  completely  determined.  In 
contrast,  all  the  other  relationships  listed  in  the  table  above  have  an  element  of 
randomness  in  them.  Consider  the  relationship  described  in  the  last  line  of  the 
table,  the  height  x  of  a  man  aged  25  and  his  weighty,  if  we  were  to  randomly  select 
several  25-year-old  men  and  measure  the  height  and  weight  of  each  one,  we  might 
obtain  a  collection  of  (x,  y)  pairs  something  like  this: 


10.1  Linear  Relationships  Between  Variables 


533 


Chapter  10  Correlation  and  Regression 


(68,151)  (69,146)  (70,157)  (70,164)  (71,171)  (72,160) 

(72,163)  (72,180)  (73,170)  (73,175)  (74,178)  (75,188) 

A  plot  of  these  data  is  shown  in  Figure  10.2  "Plot  of  Height  and  Weight  Pairs".  Such 
a  plot  is  called  a  scatter  diagram  or  scatter  plot.  Looking  at  the  plot  it  is  evident 
that  there  exists  a  linear  relationship  between  height  x  and  weight  y,  but  not  a 
perfect  one.  The  points  appear  to  be  following  a  line,  but  not  exactly.  There  is  an 
element  of  randomness  present. 


Figure  10.2  Plot  of  Height  and  Weight  Pairs 


In  this  chapter  we  will  analyze  situations  in  which  variables  x  andy  exhibit  such  a 
linear  relationship  with  randomness.  The  level  of  randomness  will  vary  from 
situation  to  situation.  In  the  introductory  example  connecting  an  electric  current 
and  the  level  of  carbon  monoxide  in  air,  the  relationship  is  almost  perfect.  In  other 
situations,  such  as  the  height  and  weights  of  individuals,  the  connection  between 
the  two  variables  involves  a  high  degree  of  randomness.  In  the  next  section  we  will 
see  how  to  quantify  the  strength  of  the  linear  relationship  between  two  variables. 
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KEY  TAKEAWAYS 


•  Two  variables  x  andy  have  a  deterministic  linear  relationship  if  points 
plotted  from  (x,  y)  pairs  lie  exactly  along  a  single  straight  line. 

•  In  practice  it  is  common  for  two  variables  to  exhibit  a  relationship  that 
is  close  to  linear  but  which  contains  an  element,  possibly  large,  of 
randomness. 
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a.  Pick  five  distinct  x-values,  use  the  equation  to  compute  the  corresponding 
y-values,  and  plot  the  five  points  obtained. 

b.  Give  the  value  of  the  slope  of  the  line;  give  the  value  of  the  y-intercept. 

2.  A  line  has  equation  y  =  X— 0.5. 

a.  Pick  five  distinct  x-values,  use  the  equation  to  compute  the  corresponding 
y-values,  and  plot  the  five  points  obtained. 

b.  Give  the  value  of  the  slope  of  the  line;  give  the  value  of  the  y-intercept. 

3.  A  line  has  equation  y  =  —  2x  +  4. 

a.  Pick  five  distinct  x-values,  use  the  equation  to  compute  the  corresponding 
y-values,  and  plot  the  five  points  obtained. 

b.  Give  the  value  of  the  slope  of  the  line;  give  the  value  of  the  y-intercept. 

4.  A  line  has  equation  y  =  —  1.5x  +  1. 

a.  Pick  five  distinct  x-values,  use  the  equation  to  compute  the  corresponding 
y-values,  and  plot  the  five  points  obtained. 

b.  Give  the  value  of  the  slope  of  the  line;  give  the  value  of  the  y-intercept. 

5.  Based  on  the  information  given  about  a  line,  determine  how  y  will  change 
(increase,  decrease,  or  stay  the  same)  when  x  is  increased,  and  explain.  In  some 
cases  it  might  be  impossible  to  tell  from  the  information  given. 

a.  The  slope  is  positive. 

b.  The  y-intercept  is  positive. 

c.  The  slope  is  zero. 

6.  Based  on  the  information  given  about  a  line,  determine  how  y  will  change 
(increase,  decrease,  or  stay  the  same)  when  x  is  increased,  and  explain.  In  some 
cases  it  might  be  impossible  to  tell  from  the  information  given. 

a.  The  y-intercept  is  negative. 

b.  The  y-intercept  is  zero. 

c.  The  slope  is  negative. 

7.  A  data  set  consists  of  eight  (x,  y )  pairs  of  numbers: 
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(0,12)  (4,16)  (8,22)  (15,28) 

(2,15)  (5,14)  (13,24)  (20,30) 

a.  Plot  the  data  in  a  scatter  diagram. 

b.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and y 
appears  to  be  deterministic  or  to  involve  randomness. 

c.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and_y 
appears  to  be  linear  or  not  linear. 

8.  A  data  set  consists  of  ten  (x,  y')  pairs  of  numbers: 

(3,20)  (6,9)  (11,0)  (14,1)  (18,9) 

(5,13)  (8,4)  (12,0)  (17,6)  (20,16) 

a.  Plot  the  data  in  a  scatter  diagram. 

b.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and_y 
appears  to  be  deterministic  or  to  involve  randomness. 

c.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and y 
appears  to  be  linear  or  not  linear. 

9.  A  data  set  consists  of  nine  (x,  y}  pairs  of  numbers: 

(8,16)  (10,4)  (12,0)  (14,4)  (16,16) 

(9,9)  (11,1)  (13,1)  (15,9) 

a.  Plot  the  data  in  a  scatter  diagram. 

b.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and y 
appears  to  be  deterministic  or  to  involve  randomness. 

c.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and_y 
appears  to  be  linear  or  not  linear. 

10.  A  data  set  consists  of  five  (x,  y'j  pairs  of  numbers: 

(0,1)  (2,5)  (3,7)  (5,11)  (8,17) 

a.  Plot  the  data  in  a  scatter  diagram. 

b.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and y 
appears  to  be  deterministic  or  to  involve  randomness. 

c.  Based  on  the  plot,  explain  whether  the  relationship  between  x  and_y 
appears  to  be  linear  or  not  linear. 
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APPLICATIONS 


11.  At  60°F  a  particular  blend  of  automotive  gasoline  weights  6.17  lb/gal.  The 
weighty  of  gasoline  on  a  tank  truck  that  is  loaded  with  x  gallons  of  gasoline  is 
given  by  the  linear  equation 

y  =  6.11 x 

a.  Explain  whether  the  relationship  between  the  weight  y  and  the  amount  x 
of  gasoline  is  deterministic  or  contains  an  element  of  randomness. 

b.  Predict  the  weight  of  gasoline  on  a  tank  truck  that  has  just  been  loaded 
with  6,750  gallons  of  gasoline. 

12.  The  rate  for  renting  a  motor  scooter  for  one  day  at  a  beach  resort  area  is  $25 
plus  30  cents  for  each  mile  the  scooter  is  driven.  The  total  costy  in  dollars  for 
renting  a  scooter  and  driving  it  x  miles  is 

y  =  0.30a:  +  25 

a.  Explain  whether  the  relationship  between  the  cost  y  of  renting  the  scooter 
for  a  day  and  the  distance  x  that  the  scooter  is  driven  that  day  is 
deterministic  or  contains  an  element  of  randomness. 

b.  A  person  intends  to  rent  a  scooter  one  day  for  a  trip  to  an  attraction  17 
miles  away.  Assuming  that  the  total  distance  the  scooter  is  driven  is  34 
miles,  predict  the  cost  of  the  rental. 

13.  The  pricing  schedule  for  labor  on  a  service  call  by  an  elevator  repair  company 
is  $150  plus  $50  per  hour  on  site. 

a.  Write  down  the  linear  equation  that  relates  the  labor  cost  y  to  the  number 
of  hours  x  that  the  repairman  is  on  site. 

b.  Calculate  the  labor  cost  for  a  service  call  that  lasts  2.5  hours. 

14.  The  cost  of  a  telephone  call  made  through  a  leased  line  service  is  2.5  cents  per 
minute. 

a.  Write  down  the  linear  equation  that  relates  the  cost  y  (in  cents)  of  a  call  to 
its  length  x. 

b.  Calculate  the  cost  of  a  call  that  lasts  23  minutes. 


LARGE  DATA  SET  EXERCISES 


15.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students.  Plot  the 

scatter  diagram  with  SAT  score  as  the  independent  variable  (x)  and  GPA  as  the 
dependent  variable  (y).  Comment  on  the  appearance  and  strength  of  any  linear 
trend. 
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http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

16.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs).  Plot  the  scatter 
diagram  with  golf  score  using  the  original  clubs  as  the  independent  variable  (x) 
and  golf  score  using  the  new  clubs  as  the  dependent  variable  (y).  Comment  on 
the  appearance  and  strength  of  any  linear  trend. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

17.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions.  Plot  the  scatter  diagram  with 
the  number  of  bidders  at  the  auction  as  the  independent  variable  (x)  and  the 
sales  price  as  the  dependent  variable  (y).  Comment  on  the  appearance  and 
strength  of  any  linear  trend. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 
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ANSWERS 


1. 

3. 

5. 


7. 


9. 


a.  Answers  vary. 

b.  Slope  tn  =  0.5;y-intercept£>  =  2. 

a.  Answers  vary. 

b.  Slope  171  =  —  2 ;  y-intercept  b  =  4. 

a.  y  increases. 

b.  Impossible  to  tell. 

c.  y  does  not  change. 

a.  Scatter  diagram  needed. 

b.  Involves  randomness. 

c.  Linear. 

a.  Scatter  diagram  needed. 

b.  Deterministic. 

c.  Not  linear. 


11. 

13. 


a.  Deterministic. 

b.  41,647.5  pounds. 

a.  y  =  50a  +  150. 

b.  b.  $275. 


15.  There  appears  to  a  hint  of  some  positive  correlation. 
17.  There  appears  to  be  clear  positive  correlation. 
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10.2  The  Linear  Correlation  Coefficient 


LEARNING  OBJECTIVE 

1.  To  learn  what  the  linear  correlation  coefficient  is,  how  to  compute  it, 
and  what  it  tells  us  about  the  relationship  between  two  variables  x  andy. 


Figure  10.3  "Linear  Relationships  of  Varying  Strengths"  illustrates  linear 
relationships  between  two  variables  x  andy  of  varying  strengths.  It  is  visually 
apparent  that  in  the  situation  in  panel  (a),  x  could  serve  as  a  useful  predictor  ofy,  it 
would  be  less  useful  in  the  situation  illustrated  in  panel  (b),  and  in  the  situation  of 
panel  (c)  the  linear  relationship  is  so  weak  as  to  be  practically  nonexistent.  The 
linear  correlation  coefficient  is  a  number  computed  directly  from  the  data  that 
measures  the  strength  of  the  linear  relationship  between  the  two  variables  x  and  y. 


Figure  10.3  Linear  Relationships  of  Varying  Strengths 
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Definition 


The  linear  correlation  coefficient1  for  a  collection  ofn  pairs  (x,  _y)  of  numbers  in 
a  sample  is  the  number  r  given  by  the  formula 


ss 


r  = 


xy 


yJSS„  •  SS„ 


where 


SSa  =  Lx2  --  (Lx)2,  SS„  =  Lxy-~  (Lx)  (Zy) ,  SS„ 

n  •  n  v  ' 


The  linear  correlation  coefficient  has  the  following  properties,  illustrated  in  Figure 
10.4  "Linear  Correlation  Coefficient 


1.  The  value  of  r  lies  between  -1  and  1,  inclusive. 

2.  The  sign  of  r  indicates  the  direction  of  the  linear  relationship  between 
x  and  y: 

1.  if  r  <  Othen  y  tends  to  decrease  as  x  is  increased. 

2.  if  r  >  Othen y  tends  to  increase  as  x  is  increased. 

3.  The  size  of  |r|  indicates  the  strength  of  the  linear  relationship  between 
x  and  y: 

1.  if  |r|  is  near  1  (that  is,  if  r  is  near  either  1  or  -l)  then  the  linear 
relationship  between  x  and  y  is  strong. 

2.  if  |r|  is  near  0  (that  is,  if  r  is  near  0  and  of  either  sign)  then  the 
linear  relationship  between  x  and  y  is  weak. 


1.  A  number  computed  directly 
from  the  data  that  measures 
the  strength  of  the  linear 
relationship  between  the  two 
variables  x  and  y. 
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Figure  10.4  Linear  Correlation  Coefficient  R 


• 

• 

r  =  —  1 

• 

.  • 

•  . 

••  . 

• 

r  —  —0.94 

• 

*,:• : 

r  —  +0.08 

(a) 

(b) 

(c) 

• 

• 

•  • 

•  • 

•  • 

•  • 

• 

• 

•  • 

• 

•  • 

•  • 

•  • 

• 

• 

•  • 

• 

• 

• 

/  r  =  +1 

r  =  +0.86 

r  =  0.00 

(d) 

(e) 

(/) 

Pay  particular  attention  to  panel  (f)  in  Figure  10.4  "Linear  Correlation  Coefficient 
It  shows  a  perfectly  deterministic  relationship  between  x  andy,  but  r  —  Obecause 
the  relationship  is  not  linear,  (in  this  particular  case  the  points  lie  on  the  top  half  of 
a  circle.) 
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EXAMPLE  1 


Compute  the  linear  correlation  coefficient  for  the  height  and  weight  pairs 
plotted  in  Figure  10.2  "Plot  of  Height  and  Weight  Pairs". 

Solution: 

Even  for  small  data  sets  like  this  one  computations  are  too  long  to  do 
completely  by  hand.  In  actual  practice  the  data  are  entered  into  a  calculator 
or  computer  and  a  statistics  program  is  used.  In  order  to  clarify  the  meaning 
of  the  formulas  we  will  display  the  data  and  related  quantities  in  tabular 

form.  For  each  (x,  y'j  pair  we  compute  three  numbers:  x2,  xy  ,  andy2,  as 
shown  in  the  table  provided.  In  the  last  line  of  the  table  we  have  the  sum  of 
the  numbers  in  each  column.  Using  them  we  compute: 


X 

y 

x2 

xy 

/ 

68 

151 

4624 

10268 

22801 

69 

146 

4761 

10074 

21316 

70 

157 

4900 

10990 

24649 

70 

164 

4900 

11480 

26896 

71 

171 

5041 

12141 

29241 

72 

160 

5184 

11520 

25600 

72 

163 

5184 

11736 

26569 

72 

180 

5184 

12960 

32400 

73 

170 

5329 

12410 

28900 

73 

175 

5329 

12775 

30625 

74 

178 

5476 

13172 

31684 

75 

188 

5625 

14100 

35344 

£ 

859 

2003 

61537 

143626 

336025 
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SSXX  =  Ex2  -  -  (Ex)2  =  61537  -  —  (859)2  =  46.916 
n  12 

SSxy  =  Ex y  -  -  (Ex)  ('Ey)  =  143626  -  —  (859)(2003)  =  244.583 
n  v  ’  12 

SSyy  =  'Ey2  -  (Ey)2  =  336025  -  (2003)2  =  1690.916 


so  that 


244.583 

V(46.9 1 6)(1690.9 16) 


0.868 


The  number  r  =  0.868  quantifies  what  is  visually  apparent  from  Figure 
10.2  "Plot  of  Height  and  Weight  Pairs":  weights  tends  to  increase  linearly 
with  height  (r  is  positive)  and  although  the  relationship  is  not  perfect,  it  is 
reasonably  strong  (r  is  near  l). 


KEY  TAKEAWAYS 


•  The  linear  correlation  coefficient  measures  the  strength  and  direction  of 
the  linear  relationship  between  two  variables  x  andy. 

•  The  sign  of  the  linear  correlation  coefficient  indicates  the  direction  of 
the  linear  relationship  between  x  andy. 

•  When  r  is  near  1  or  -1  the  linear  relationship  is  strong;  when  it  is  near  0 
the  linear  relationship  is  weak. 
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With  the  exception  of  the  exercises  at  the  end  of  Section  10.3  "Modelling 
Linear  Relationships  with  Randomness  Present",  the  first  Basic  exercise  in 
each  of  the  following  sections  through  Section  10.7  "Estimation  and 
Prediction"  uses  the  data  from  the  first  exercise  here,  the  second  Basic 
exercise  uses  the  data  from  the  second  exercise  here,  and  so  on,  and 
similarly  for  the  Application  exercises.  Save  your  computations  done  on 
these  exercises  so  that  you  do  not  need  to  repeat  them  later. 

1.  For  the  sample  data 


X 

0 

1 

3 

5 

8 

y 

2 

4 

6 

5 

9 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 

2.  For  the  sample  data 


X 

0 

2 

3 

6 

9 

y 

0 

3 

3 

4 

8 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 

3.  For  the  sample  data 


X 

l 

3 

4 

6 

8 

y 

4 

1 

3 

-1 

0 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 
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4.  For  the  sample  data 


X 

l 

2 

4 

7 

9 

y 

5 

5 

6 

-3 

0 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 

5.  For  the  sample  data 


X 

l 

1 

3 

4 

5 

y 

2 

1 

5 

3 

4 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 

6.  For  the  sample  data 


X 

l 

3 

5 

5 

8 

y 

5 

-2 

2 

-1 

-3 

a.  Draw  the  scatter  plot. 

b.  Based  on  the  scatter  plot,  predict  the  sign  of  the  linear  correlation 
coefficient.  Explain  your  answer. 

c.  Compute  the  linear  correlation  coefficient  and  compare  its  sign  to  your 
answer  to  part  (b). 

7.  Compute  the  linear  correlation  coefficient  for  the  sample  data  summarized  by 
the  following  information: 


n  =  5 


£x  =  25 
.2 


Ex2  =  165 


Ey  =  24  =  134  £xy  =  144 

1  <  x  <  9 

8.  Compute  the  linear  correlation  coefficient  for  the  sample  data  summarized  by 
the  following  information: 
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n  =  5  Ex  =  31  Ex2  =253 

Ey  =  18  Ey2  =  90  Exy  =  148 
2  <  x  <  12 

9.  Compute  the  linear  correlation  coefficient  for  the  sample  data  summarized  by 
the  following  information: 

n=  10  Ex  =  0  Ex2  =60 

Ey  =  24  Ey2  =  234  Exy  =  -87 
— 4  <  x  <  4 

10.  Compute  the  linear  correlation  coefficient  for  the  sample  data  summarized  by 
the  following  information: 

n  =  10  Ex  =  —3  Ex2  =263 

Ey  =  55  Ey2  =  917  Exy  =  -355 
—  1 0  <  x  <  10 


APPLICATIONS 


11.  The  age  x  in  months  and  vocabulary  y  were  measured  for  six  children,  with  the 
results  shown  in  the  table. 


X 

13 

14 

15 

16 

16 

18 

y 

8 

10 

15 

20 

27 

30 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

12.  The  curb  weight  x  in  hundreds  of  pounds  and  braking  distance  y  in  feet,  at  50 
miles  per  hour  on  dry  pavement,  were  measured  for  five  vehicles,  with  the 
results  shown  in  the  table. 


X 

25 

27.5 

32.5 

35 

45 

y 

105 

125 

140 

140 

150 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 


13.  The  age  x  and  resting  heart  rate  y  were  measured  for  ten  men,  with  the  results 
shown  in  the  table. 
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X 

20  23  30  37  35 

y 

72  71  73  74  74 

X 

45  51  55  60  63 

y 

73  72  79  75  77 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 


14.  The  wind  speed  x  in  miles  per  hour  and  wave  height  y  in  feet  were  measured 
under  various  conditions  on  an  enclosed  deep  water  sea,  with  the  results 
shown  in  the  table, 


X 

0 

0 

2 

7 

7 

y 

2.0 

0.0 

0.3 

0.7 

3.3 

X 

9 

13 

20 

22 

31 

y 

4.9 

4.9 

3.0 

6.9 

5.9 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

15.  The  advertising  expenditure  x  and  sales  y  in  thousands  of  dollars  for  a  small 
retail  business  in  its  first  eight  years  in  operation  are  shown  in  the  table. 


X 

1.4 

1.6 

1.6 

2.0 

y 

180 

184 

190 

220 

X 

2.0 

2.2 

2.4 

2.6 

y 

186 

215 

205 

240 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

16.  The  height  x  at  age  2  andy  at  age  20,  both  in  inches,  for  ten  women  are 
tabulated  in  the  table. 


X 

31.3 

31.7 

32.5 

33.5 

34.4 

y 

60.7 

61.0 

63.1 

64.2 

65.9 

X 

35.2 

35.8 

32.7 

33.6 

34.8 

y 

68.2 

67.6 

62.3 

64.9 

66.8 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 
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17.  The  course  average  x  just  before  a  final  exam  and  the  score  y  on  the  final  exam 
were  recorded  for  15  randomly  selected  students  in  a  large  physics  class,  with 
the  results  shown  in  the  table. 


X 

69.3 

87.7 

50.5 

51.9 

82.7 

y 

56 

89 

55 

49 

61 

X 

70.5 

72.4 

91.7 

83.3 

86.5 

y 

66 

72 

83 

73 

82 

X 

79.3 

78.5 

75.7 

52.3 

62.2 

y 

92 

80 

64 

18 

76 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

18.  The  table  shows  the  acres  x  of  corn  planted  and  acres  y  of  corn  harvested,  in 
millions  of  acres,  in  a  particular  country  in  ten  successive  years. 


X 

75.7 

78.9 

78.6 

80.9 

81.8 

y 

68.8 

69.3 

70.9 

73.6 

75.1 

X 

78.3 

93.5 

85.9 

86.4 

88.2 

y 

70.6 

86.5 

78.6 

79.5 

81.4 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

19.  Fifty  male  subjects  drank  a  measured  amount  x  (in  ounces)  of  a  medication  and 
the  concentration  y  (in  percent)  in  their  blood  of  the  active  ingredient  was 
measured  30  minutes  later.  The  sample  data  are  summarized  by  the  following 
information. 

n  =  50  Ex  =  112.5  Ey  =  4.83 
Exy  =  15.255  0  <  x  <  4.5 
Ex2  =  356.25  Ey2  =  0.667 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

20.  In  an  effort  to  produce  a  formula  for  estimating  the  age  of  large  free-standing 
oak  trees  non-invasively,  the  girth  x  (in  inches)  five  feet  off  the  ground  of  15 
such  trees  of  known  agey  (in  years)  was  measured.  The  sample  data  are 
summarized  by  the  following  information. 


10.2  The  Linear  Correlation  Coefficient 


550 


Chapter  10  Correlation  and  Regression 


n=  15  Ex  =  3368  Ey  =  6496 

Exy  =  1,933,219  Ex2  =  917,780 
Ey2  =  4,260,666  74  <  x  <  395 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

21.  Construction  standards  specify  the  strength  of  concrete  28  days  after  it  is 
poured.  For  30  samples  of  various  types  of  concrete  the  strength  x  after  3  days 
and  the  strength  y  after  28  days  (both  in  hundreds  of  pounds  per  square  inch) 
were  measured.  The  sample  data  are  summarized  by  the  following 
information. 

n  =  30  Ex  =  501.6  Ey  =  1338.8 

Exy  =  23,246.55  Ex2  =  8724.74 
Ey2  =  61,980.14  ll<x<22 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 

22.  Power-generating  facilities  used  forecasts  of  temperature  to  forecast  energy 
demand.  The  average  temperature  x  (degrees  Fahrenheit)  and  the  day’s  energy 
demand  y  (million  watt-hours)  were  recorded  on  40  randomly  selected  winter 
days  in  the  region  served  by  a  power  company.  The  sample  data  are 
summarized  by  the  following  information. 

n  =  40  Ex  =  2000  Ey  =  2969 

Exy  =  143,042  Ex2  =  101,340 
Ey2  =  243,027  40  <  x  <  60 

Compute  the  linear  correlation  coefficient  for  these  sample  data  and  interpret 
its  meaning  in  the  context  of  the  problem. 


ADDITIONAL  EXERCISES 


23.  In  each  case  state  whether  you  expect  the  two  variables  x  and y  indicated  to 
have  positive,  negative,  or  zero  correlation. 

a.  the  number  x  of  pages  in  a  book  and  the  age  y  of  the  author 

b.  the  number  x  of  pages  in  a  book  and  the  age  y  of  the  intended  reader 

c.  the  weight  x  of  an  automobile  and  the  fuel  economy  y  in  miles  per  gallon 

d.  the  weight  x  of  an  automobile  and  the  reading  y  on  its  odometer 


10.2  The  Linear  Correlation  Coefficient 


551 


Chapter  10  Correlation  and  Regression 


e.  the  amount  x  of  a  sedative  a  person  took  an  hour  ago  and  the  time  y  it 
takes  him  to  respond  to  a  stimulus 

24.  In  each  case  state  whether  you  expect  the  two  variables  x  and y  indicated  to 
have  positive,  negative,  or  zero  correlation. 

a.  the  length  x  of  time  an  emergency  flare  will  burn  and  the  length  y  of  time 
the  match  used  to  light  it  burned 

b.  the  average  length  x  of  time  that  calls  to  a  retail  call  center  are  on  hold  one 
day  and  the  number  y  of  calls  received  that  day 

c.  the  length  x  of  a  regularly  scheduled  commercial  flight  between  two  cities 
and  the  headwind  y  encountered  by  the  aircraft 

d.  the  value  x  of  a  house  and  the  its  sizey  in  square  feet 

e.  the  average  temperature  x  on  a  winter  day  and  the  energy  consumption  y 
of  the  furnace 

25.  Changing  the  units  of  measurement  on  two  variables  x  and  y  should  not  change 
the  linear  correlation  coefficient.  Moreover,  most  change  of  units  amount  to 
simply  multiplying  one  unit  by  the  other  (for  example,  1  foot  =  12  inches). 
Multiply  each  x  value  in  the  table  in  Exercise  1  by  two  and  compute  the  linear 
correlation  coefficient  for  the  new  data  set.  Compare  the  new  value  of  r  to  the 
one  for  the  original  data. 

26.  Refer  to  the  previous  exercise.  Multiply  each  x  value  in  the  table  in  Exercise  2 
by  two,  multiply  each  y  value  by  three,  and  compute  the  linear  correlation 
coefficient  for  the  new  data  set.  Compare  the  new  value  of  r  to  the  one  for  the 
original  data. 

27.  Reversing  the  roles  of  x  and  y  in  the  data  set  of  Exercise  1  produces  the  data  set 
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8 

Compute  the  linear  correlation  coefficient  of  the  new  set  of  data  and  compare 
it  to  what  you  got  in  Exercise  1. 

28.  In  the  context  of  the  previous  problem,  look  at  the  formula  for  r  and  see  if  you 
can  tell  why  what  you  observed  there  must  be  true  for  every  data  set. 


LARGE  DATA  SET  EXERCISES 


29.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students.  Compute  the 
linear  correlation  coefficient  r.  Compare  its  value  to  your  comments  on  the 
appearance  and  strength  of  any  linear  trend  in  the  scatter  diagram  that  you 
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constructed  in  the  first  large  data  set  problem  for  Section  10.1  "Linear 
Relationships  Between  Variables". 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

30.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs).  Compute  the  linear 
correlation  coefficient  r.  Compare  its  value  to  your  comments  on  the 
appearance  and  strength  of  any  linear  trend  in  the  scatter  diagram  that  you 
constructed  in  the  second  large  data  set  problem  for  Section  10.1  "Linear 
Relationships  Between  Variables". 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

31.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions.  Compute  the  linear 
correlation  coefficient  r.  Compare  its  value  to  your  comments  on  the 
appearance  and  strength  of  any  linear  trend  in  the  scatter  diagram  that  you 
constructed  in  the  third  large  data  set  problem  for  Section  10.1  "Linear 
Relationships  Between  Variables". 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 
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ANSWERS 


1. 

r  =  0.921 

3. 

r  =  -0.794 

5. 

r  =  0.707 

7. 

0.875 

9. 

-0.846 

11. 

0.948 

13. 

0.709 

15. 

0.832 

17. 

0.751 

19. 

0.965 

21. 

0.992 

23. 

a.  zero 

b.  positive 

c.  negative 

d.  zero 

e.  positive 

25. 

same  value 

27. 

same  value 

29. 

r  =  0.4601 

31. 

r  =  0.9002 
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10.3  Modelling  Linear  Relationships  with  Randomness  Present 


LEARNING  OBJECTIVE 

1.  To  learn  the  framework  in  which  the  statistical  analysis  of  the  linear 
relationship  between  two  variables  x  and  y  will  be  done. 


In  this  chapter  we  are  dealing  with  a  population  for  which  we  can  associate  to  each 
element  two  measurements,  x  andy.  We  are  interested  in  situations  in  which  the 
value  of  x  can  be  used  to  draw  conclusions  about  the  value  of  y,  such  as  predicting 
the  resale  value  y  of  a  residential  house  based  on  its  size  x.  Since  the  relationship 
between  x  andy  is  not  deterministic,  statistical  procedures  must  be  applied.  For  any 
statistical  procedures,  given  in  this  book  or  elsewhere,  the  associated  formulas  are 
valid  only  under  specific  assumptions.  The  set  of  assumptions  in  simple  linear 
regression  are  a  mathematical  description  of  the  relationship  between  x  andy.  Such 
a  set  of  assumptions  is  known  as  a  model. 


For  each  fixed  value  of  x  a  sub-population  of  the  full  population  is  determined,  such 
as  the  collection  of  all  houses  with  2,100  square  feet  of  living  space.  For  each 
element  of  that  sub-population  there  is  a  measurement  y,  such  as  the  value  of  any 
2,100-square-foot  house.  Let  E  (y)  denote  the  mean  of  all  they-values  for  each 
particular  value  of  x.E  (y)  can  change  from  x-value  to  x-value,  such  as  the  mean 
value  of  all  2,100-square-foot  houses,  the  (different)  mean  value  for  all  2,500-square 
foot-houses,  and  so  on. 


Our  first  assumption  is  that  the  relationship  between  x  and  the  mean  of  the  y-values 
in  the  sub-population  determined  by  x  is  linear.  This  means  that  there  exist 
numbers  and  fi{)  such  that 


E  {y)  =  fl\x  +  A) 

This  linear  relationship  is  the  reason  for  the  word  “linear”  in  “simple  linear 
regression”  below.  (The  word  “simple”  means  thaty  depends  on  only  one  other 
variable  and  not  two  or  more.) 


Our  next  assumption  is  that  for  each  value  of  x  the  y-values  scatter  about  the  mean 
E  (y )  according  to  a  normal  distribution  centered  at  E  (y)  and  with  a  standard 
deviation  a  that  is  the  same  for  every  value  of  x.  This  is  the  same  as  saying  that 
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there  exists  a  normally  distributed  random  variable  £  with  mean  0  and  standard 
deviation  a  so  that  the  relationship  between  x  andy  in  the  whole  population  is 

y  =  A*  + A)  +  £ 

Our  last  assumption  is  that  the  random  deviations  associated  with  different 
observations  are  independent. 


In  summary,  the  model  is: 


Simple  Linear  Regression  Model 

For  each  point  (x,  y)  in  data  set  they-value  is  an  independent  observation  of 

y  =  Plx  +  fi o  +  e 

where  and  f:J{)  are  fixed  parameters  and  £  is  a  normally  distributed  random 
variable  with  mean  0  and  an  unknown  standard  deviation  a. 

The  line  with  equation  y  =  fax  +  is  called  the  population  regression 
line2. 


Figure  10.5  "The  Simple  Linear  Model  Concept"  illustrates  the  model.  The  symbols 
N  (/<,  a2)  denote  a  normal  distribution  with  mean  /./  and  variance  a2 ,  hence 
standard  deviation  a. 


2.  The  line  with  equation 
y  =  fa  X  +  /?Q  that  gives  the 
mean  of  the  variable  y  over  the 
sub-population  determined  by 
x. 
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Figure  10.5  The  Simple  Linear  Model  Concept 


It  is  conceptually  important  to  view  the  model  as  a  sum  of  two  parts: 


y  = 


PlX  +  Po 


1.  Deterministic  Part.  The  first  part  fdxx  +  fdQ  is  the  equation  that 
describes  the  trend  iny  as  x  increases.  The  line  that  we  seem  to  see 
when  we  look  at  the  scatter  diagram  is  an  approximation  of  the  line 

y  —  jdyX  +  (dQ.  There  is  nothing  random  in  this  part,  and  therefore  it  is 
called  the  deterministic  part  of  the  model. 

2.  Random  Part.  The  second  part  £  is  a  random  variable,  often  called  the 
error  term  or  the  noise.  This  part  explains  why  the  actual  observed 
values  ofy  are  not  exactly  on  but  fluctuate  near  a  line.  Information 
about  this  term  is  important  since  only  when  one  knows  how  much 
noise  there  is  in  the  data  can  one  know  how  trustworthy  the  detected 
trend  is. 


There  are  three  parameters  in  this  model:  /fy,  /?! ,  and  a.  Each  has  an  important 
interpretation,  particularly  fd^  and  a.  The  slope  parameter  / d j  represents  the 
expected  change  iny  brought  about  by  a  unit  increase  in  x.  The  standard  deviation 
a  represents  the  magnitude  of  the  noise  in  the  data. 


There  are  procedures  for  checking  the  validity  of  the  three  assumptions,  but  for  us 
it  will  be  sufficient  to  visually  verify  the  linear  trend  in  the  data,  if  the  data  set  is 
large  then  the  points  in  the  scatter  diagram  will  form  a  band  about  an  apparent 
straight  line.  The  normality  of  £  with  a  constant  standard  deviation  corresponds 
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graphically  to  the  band  being  of  roughly  constant  width,  and  with  most  points 
concentrated  near  the  middle  of  the  band. 


Fortunately,  the  three  assumptions  do  not  need  to  hold  exactly  in  order  for  the 
procedures  and  analysis  developed  in  this  chapter  to  be  useful. 


KEY  TAKEAWAY 


•  Statistical  procedures  are  valid  only  when  certain  assumptions  are  valid. 
The  assumptions  underlying  the  analyses  done  in  this  chapter  are 
graphically  summarized  in  Figure  10.5  "The  Simple  Linear  Model 
Concept". 


EXERCISES 


1.  State  the  three  assumptions  that  are  the  basis  for  the  Simple  Linear  Regression 
Model. 


2.  The  Simple  Linear  Regression  Model  is  summarized  by  the  equation 

y=plx  +  fi o  +  e 

Identify  the  deterministic  part  and  the  random  part. 

3.  Is  the  number  [ in  the  equation  y  =  X  +  /?q  a  statistic  or  a  population 
parameter?  Explain. 

4.  Is  the  number  a  in  the  Simple  Linear  Regression  Model  a  statistic  or  a 
population  parameter?  Explain. 

5.  Describe  what  to  look  for  in  a  scatter  diagram  in  order  to  check  that  the 
assumptions  of  the  Simple  Linear  Regression  Model  are  true. 

6.  True  or  false:  the  assumptions  of  the  Simple  Linear  Regression  Model  must 
hold  exactly  in  order  for  the  procedures  and  analysis  developed  in  this  chapter 
to  be  useful. 
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ANSWERS 


1.  a.  The  mean  of y  is  linearly  related  to  x. 

b.  For  each  given  x,  y  is  a  normal  random  variable  with  mean  PyX  +  Pq  and 
standard  deviation  a. 

c.  All  the  observations  of  y  in  the  sample  are  independent. 

3.  is  a  population  parameter. 

5.  A  linear  trend. 
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10.4  The  Least  Squares  Regression  Line 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  measure  how  well  a  straight  line  fits  a  collection  of  data. 

2.  To  learn  how  to  construct  the  least  squares  regression  line,  the  straight 
line  that  best  fits  a  collection  of  data. 

3.  To  learn  the  meaning  of  the  slope  of  the  least  squares  regression  line. 

4.  To  learn  how  to  use  the  least  squares  regression  line  to  estimate  the 
response  variable  y  in  terms  of  the  predictor  variable  x. 


Goodness  of  Fit  of  a  Straight  Line  to  Data 

Once  the  scatter  diagram  of  the  data  has  been  drawn  and  the  model  assumptions 
described  in  the  previous  sections  at  least  visually  verified  (and  perhaps  the 
correlation  coefficient  r  computed  to  quantitatively  verify  the  linear  trend),  the 
next  step  in  the  analysis  is  to  find  the  straight  line  that  best  fits  the  data.  We  will 
explain  how  to  measure  how  well  a  straight  line  fits  a  collection  of  points  by 
examining  how  well  the  line  y  =  \  x—  1  fits  the  data  set 


X 

2 

2 
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8 

10 

y 

0 

1 

2 

3 

3 

(which  will  be  used  as  a  running  example  for  the  next  three  sections).  We  will  write 
the  equation  of  this  line  as  y  —  ^  x—  1  with  an  accent  on  the  y  to  indicate  that  the 
y-values  computed  using  this  equation  are  not  from  the  data.  We  will  do  this  with 
all  lines  approximating  data  sets.  The  line  y  —  x—  1  was  selected  as  one  that 
seems  to  fit  the  data  reasonably  well. 


The  idea  for  measuring  the  goodness  of  fit  of  a  straight  line  to  data  is  illustrated  in 
Figure  10.6  "Plot  of  the  Five-Point  Data  and  the  Line  ",  in  which  the  graph  of  the 
line  y  —  j  x—  1  has  been  superimposed  on  the  scatter  plot  for  the  sample  data  set. 
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Figure  10.6  Plot  of  the  Five-Point  Data  and  the  Line  y  —  jX—l 


To  each  point  in  the  data  set  there  is  associated  an  “error3,”  the  positive  or 
negative  vertical  distance  from  the  point  to  the  line:  positive  if  the  point  is  above 
the  line  and  negative  if  it  is  below  the  line.  The  error  can  be  computed  as  the  actual 
y-value  of  the  point  minus  the  y-value  y  that  is  “predicted”  by  inserting  the  x-value 
of  the  data  point  into  the  formula  for  the  line: 


error  at  data  point  (x,y)  =  (truey)  -  (predicted  y)  =  y  —  y 

The  computation  of  the  error  for  each  of  the  five  points  in  the  data  set  is  shown  in 
Table  10.1  "The  Errors  in  Fitting  Data  with  a  Straight  Line". 


Table  10.1  The  Errors  in  Fitting  Data  with  a  Straight  Line 


3.  Using  y  —  y ,  the  actual  y- 
value  of  a  data  point  minus  the 
y-value  that  is  computed  from 
the  equation  of  the  line  fitting 
the  data. 


X 

y 

^  i  i 

y  =  2  x~l 

/V 

y-y 

(y-y)1 

2 

0 

0 

0 

0 

2 

i 

0 

1 

1 
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X 

y 

1  i 

y  =  2x~l 

y-y 

IN 

1 

6 

2 

2 

0 

0 

8 
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3 

0 

0 

10 

3 

4 

-1 

1 

E 

- 

- 

- 

0 
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A  first  thought  for  a  measure  of  the  goodness  of  fit  of  the  line  to  the  data  would  be 
simply  to  add  the  errors  at  every  point,  but  the  example  shows  that  this  cannot 
work  well  in  general.  The  line  does  not  fit  the  data  perfectly  (no  line  can),  yet 
because  of  cancellation  of  positive  and  negative  errors  the  sum  of  the  errors  (the 
fourth  column  of  numbers)  is  zero.  Instead  goodness  of  fit  is  measured  by  the  sum 
of  the  squares  of  the  errors.  Squaring  eliminates  the  minus  signs,  so  no  cancellation 
can  occur.  For  the  data  and  line  in  Figure  10.6  "Plot  of  the  Five-Point  Data  and  the 
Line  "  the  sum  of  the  squared  errors  (the  last  column  of  numbers)  is  2.  This  number 
measures  the  goodness  of  fit  of  the  line  to  the  data. 


Definition 

The  goodness  of  fit  of  a  line  y  —  mx  +  b  to  a  set  ofn  pairs  (x,  y)  of  numbers  in  a 
sample  is  the  sum  of  the  squared  errors 

Z(y-y)2 

(n  terms  in  the  sum,  one  for  each  data  pair). 


The  Least  Squares  Regression  Line 

Given  any  collection  of  pairs  of  numbers  (except  when  all  the  x-values  are  the  same) 
and  the  corresponding  scatter  diagram,  there  always  exists  exactly  one  straight  line 
that  fits  the  data  better  than  any  other,  in  the  sense  of  minimizing  the  sum  of  the 
squared  errors.  It  is  called  the  least  squares  regression  line.  Moreover  there  are 
formulas  for  its  slope  andy-intercept. 
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Definition 


Given  a  collection  of  pairs  (x,  y )  of  numbers  (in  which  not  all  the  x-values  are  the 

/v  /V  /V 

same),  there  is  a  line  y  —  p\X  +  [)$  that  best  fits  the  data  in  the  sense  of  minimizing 
the  sum  of  the  squared  errors.  It  is  called  the  least  squares  regression  line4.  Its 

XV  XV 

slope  and  y-intercept  Pq  are  computed  using  the  formulas 


P 1  = 


SSXy 

SSXX 


and  Pq  =  _y  —  P\X 


where 


SS„  =  IX2  -  -  (Lx)2,  SS„  =  Lxy  -  -  (Lx)  ( 

n  '  n 

x  is  the  mean  of  all  the  x-values,  y  is  the  mean  of  all  they-values,  and  n  is  the  number 
of  pairs  in  the  data  set. 


The  equation  y  —  f!  \X  +  Pq  specifying  the  least  squares  regression  line  is  called  the 

least  squares  regression  equation5. 


Remember  from  Section  10.3  "Modelling  Linear  Relationships  with  Randomness 
Present"  that  the  line  with  the  equation  y  —  PiX  +  Pq  is  called  the  population 
regression  line.  The  numbers  p  i  and  Pq  are  statistics  that  estimate  the  population 
parameters  Pi  and  Pq. 


We  will  compute  the  least  squares  regression  line  for  the  five-point  data  set,  then 
for  a  more  practical  example  that  will  be  another  running  example  for  the 
introduction  of  new  concepts  in  this  and  the  next  three  sections. 


4.  The  line  that  best  fits  a  set  of 
sample  data  in  the  sense  of 
minimizing  the  sum  of  the 
squared  errors. 

5.  The  equation  y  =  p  j  X  +  Pq 
of  the  least  squares  regression 
line. 
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EXAMPLE  2 


Find  the  least  squares  regression  line  for  the  five-point  data  set 


X 
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and  verify  that  it  fits  the  data  better  than  the  line  y  =  j  X—  1  considered 
in  Section  10.4.1  "Goodness  of  Fit  of  a  Straight  Line  to  Data". 

Solution: 

In  actual  practice  computation  of  the  regression  line  is  done  using  a 
statistical  computation  package.  In  order  to  clarify  the  meaning  of  the 
formulas  we  display  the  computations  in  tabular  form. 


X 

y 

x2 

xy 

2 

0 

4 

0 

2 

i 

4 

2 

6 

2 

36 

12 

8 

3 

64 

24 

10 

3 

100 

30 

s 

28 

9 

208 

68 

In  the  last  line  of  the  table  we  have  the  sum  of  the  numbers  in  each  column. 
Using  them  we  compute: 
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ssxx 

=  lx2  - 

SSv 

=  Ixy  - 

X 

ii 

3  M 

3  * 

II 

y 

 Zy  

n 

so  that 

SS,y 

17.6 

~  SSXX 

“  51.2  “ 

-  (1.x)2  =  208  -  \  (28)2  =  51.2 
n  5 


-  (IJtXIy)  =  68  -  i  (28)(9)  =  17.6 

n  5 


28 

T 

9 

5 


■  =  5.6 

=  1.8 


0.34375  and  fi0=y-fli 3c  =  1.8  -  (0.34375)( 


The  least  squares  regression  line  for  these  data  is 

£  -  0.34375x-0.125 

The  computations  for  measuring  how  well  it  fits  the  sample  data  are  given  in 
Table  10.2  "The  Errors  in  Fitting  Data  with  the  Least  Squares  Regression 

Line".  The  sum  of  the  squared  errors  is  the  sum  of  the  numbers  in  the  last 
column,  which  is  0.75.  It  is  less  than  2,  the  sum  of  the  squared  errors  for  the 

fit  of  the  line  y  =  \  X—  1  to  this  data  set. 


TABLE  10.2  THE  ERRORS  IN  FITTING  DATA  WITH  THE 
LEAST  SQUARES  REGRESSION  LINE 


X 

y 

y  =  0.34375x— 0.125 

y-y 

(y-?f 

2 

0 

0.5625 

-0.5625 

0.31640625 

2 

i 

0.5625 

0.4375 

0.19140625 

6 

2 

1.9375 

0.0625 

0.00390625 

8 

3 

2.6250 

0.3750 

0.14062500 

10 

3 

3.3125 

-0.3125 

0.09765625 
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EXAMPLE  3 


Table  10.3  "Data  on  Age  and  Value  of  Used  Automobiles  of  a  Specific  Make 

and  Model"  shows  the  age  in  years  and  the  retail  value  in  thousands  of 

dollars  of  a  random  sample  of  ten  automobiles  of  the  same  make  and  model. 

a.  Construct  the  scatter  diagram. 

b.  Compute  the  linear  correlation  coefficient  r.  Interpret  its  value  in  the 
context  of  the  problem. 

c.  Compute  the  least  squares  regression  line.  Plot  it  on  the  scatter  diagram. 

d.  Interpret  the  meaning  of  the  slope  of  the  least  squares  regression  line  in 
the  context  of  the  problem. 

e.  Suppose  a  four-year-old  automobile  of  this  make  and  model  is  selected 
at  random.  Use  the  regression  equation  to  predict  its  retail  value. 

f.  Suppose  a  20-year-old  automobile  of  this  make  and  model  is  selected  at 
random.  Use  the  regression  equation  to  predict  its  retail  value.  Interpret 
the  result. 

g.  Comment  on  the  validity  of  using  the  regression  equation  to  predict  the 
price  of  a  brand  new  automobile  of  this  make  and  model. 


TABLE  10.3  DATA  ON  AGE  AND  VALUE  OF  USED 
AUTOMOBILES  OF  A  SPECIFIC  MAKE  AND  MODEL 


X 

2 

3 

3 

3 

4 

4 

5 

5 

5 

6 

y 

28.7 

24.8 

26.0 

30.5 

23.8 

24.6 

23.8 

20.4 

21.6 

22.1 

Solution: 

a.  The  scatter  diagram  is  shown  in  Figure  10.7  "Scatter  Diagram  for  Age 
and  Value  of  Used  Automobiles". 
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Figure  10.7 

Scatter  Diagram  for  Age  and  Value  of  Used  Automobiles 
35 


a.  We  must  first  compute  SSXX  ,  SSjyy  ,  SSyy  ,  which  means 

2  2 

computing  Zx ,  Z/y ,  Zx  ,  Z_y  ,  and  Zxy.  Using  a  computing 

device  we  obtain 

Zx  =  40  Z_y  =  246.3  Zx2  =  174  Zy;2  =  6154.15  Zx.y  =  956.5 

Thus 

=  X*2  -  -  (1.x)2  =  174-2-  (40) 2  =  14 
n  10 

SS^  =  Zxy  -  -  (Zx)(Z_y)  =  956.5  -  —  (40)(246.3)  =  -28.7 
n  10 

SSyy  =  Zy2  -  -  (Zy;)2  =  6154.15  -  —  (246.3) 2  =  87.781 
n  10 

so  that 
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ss 


r  = 


xy 


-28.7 


=  -0.819 


y/SSja  S  S  yy  ^/(14)(87.781) 


The  age  and  value  of  this  make  and  model  automobile  are 
moderately  strongly  negatively  correlated.  As  the  age  increases, 
the  value  of  the  automobile  tends  to  decrease. 


b.  Using  the  values  of  Xx  and  Xy  computed  in  part  (b), 


_  X*  40  „  ,  _  Xy  246.3 

x  =  —  =  —  =  4  and  y  =  —  =  -  =  24.63 

n  10  -  n  10 


Thus  using  the  values  of  SS  xx  and  SSxy  from  part  (b). 


SS 


n  = 


xy 


-28.7 


SS 


XX 


14 


=  -2.05  and  p0=y-plx=  24.63  -  (- 


The  equation  y  =  /?  j X  +  /?o°f  the  least  squares  regression 
line  for  these  sample  data  is 


y  =  -2.05x  +  32.83 


Figure  10.8  "Scatter  Diagram  and  Regression  Line  for  Age  and 

Value  of  Used  Automobiles"  shows  the  scatter  diagram  with  the 
graph  of  the  least  squares  regression  line  superimposed. 
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Figure  10.8 

Scatter  Diagram  and  Regression  Line  for  Age  and  Value  of  Used  Automobiles 


a.  The  slope  -2.05  means  that  for  each  unit  increase  in  x  (additional  year  of 
age)  the  average  value  of  this  make  and  model  vehicle  decreases  by 
about  2.05  units  (about  $2,050). 

b.  Since  we  know  nothing  about  the  automobile  other  than  its  age, 
we  assume  that  it  is  of  about  average  value  and  use  the  average 
value  of  all  four-year-old  vehicles  of  this  make  and  model  as  our 
estimate.  The  average  value  is  simply  the  value  of  _y  obtained 
when  the  number  4  is  inserted  for  x  in  the  least  squares 
regression  equation: 

y  =  -2.05  (4)  +  32.83  =  24.63 

which  corresponds  to  $24,630. 

c.  Now  we  insert  X  =  20  into  the  least  squares  regression 
equation,  to  obtain 

y  =  -2.05  (20)  +  32.83  =  -8.17 

which  corresponds  to  -$8,170.  Something  is  wrong  here,  since  a 
negative  makes  no  sense.  The  error  arose  from  applying  the 
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regression  equation  to  a  value  of  x  not  in  the  range  of  x-values  in 
the  original  data,  from  two  to  six  years. 


Applying  the  regression  equation  y  =  P  j  X  +  /?  q  to  a  value  of  x 
outside  the  range  of  x-values  in  the  data  set  is  called  extrapolation. 

It  is  an  invalid  use  of  the  regression  equation  and  should  be 
avoided. 

d.  The  price  of  a  brand  new  vehicle  of  this  make  and  model  is  the  value  of 
the  automobile  at  age  0.  if  the  value  X  =  0  is  inserted  into  the 

/v 

regression  equation  the  result  is  always  Pq,  they-intercept,  in  this  case 
32.83,  which  corresponds  to  $32,830.  But  this  is  a  case  of  extrapolation, 
just  as  part  (f)  was,  hence  this  result  is  invalid,  although  not  obviously 
so.  In  the  context  of  the  problem,  since  automobiles  tend  to  lose  value 
much  more  quickly  immediately  after  they  are  purchased  than  they  do 
after  they  are  several  years  old,  the  number  $32,830  is  probably  an 
underestimate  of  the  price  of  a  new  automobile  of  this  make  and  model. 


For  emphasis  we  highlight  the  points  raised  by  parts  (f)  and  (g)  of  the  example. 


Definition 

The  process  of  using  the  least  squares  regression  equation  to  estimate  the  value  ofy  at  a 
value  ofx  that  does  not  lie  in  the  range  of  the  x-values  in  the  data  set  that  was  used  to 
form  the  regression  line  is  called  extrapolation6.  It  is  an  invalid  use  of  the  regression 
equation  that  can  lead  to  errors,  hence  should  be  avoided. 


The  Sum  of  the  Squared  Errors  SSE 

In  general,  in  order  to  measure  the  goodness  of  fit  of  a  line  to  a  set  of  data,  we  must 
compute  the  predicted  y-value  y  at  every  point  in  the  data  set,  compute  each  error, 
square  it,  and  then  add  up  all  the  squares.  In  the  case  of  the  least  squares  regression 
line,  however,  the  line  that  best  fits  the  data,  the  sum  of  the  squared  errors  can  be 
computed  directly  from  the  data  using  the  following  formula. 


6.  The  process  of  using  the  least 
squares  regression  equation  to 
estimate  the  value  ofy  at  an  x 
value  not  in  the  proper  range. 
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The  sum  of  the  squared  errors  for  the  least  squares  regression  line  is  denoted 
by  SSE.  It  can  be  computed  using  the  formula 


SSE  =  SSyy 


10.4  The  Least  Squares  Regression  Line 


571 


Chapter  10  Correlation  and  Regression 


EXAMPLE  4 


Find  the  sum  of  the  squared  errors  SSE  for  the  least  squares  regression 
line  for  the  five-point  data  set 


X 

2 

2 

6 

8 

10 

J 

0 

1 

2 

3 

3 

Do  so  in  two  ways: 

a.  using  the  definition  —  y  )  ; 

b.  using  the  formula  SSE  =  SSy v  —  f)\SSxy. 

Solution: 

a.  The  least  squares  regression  line  was  computed  in  Note  10.18  "Example 
2^  and  is  y  =  0.34375x— 0.125.  SSE  was  found  at  the  end  of  that 

example  using  the  definition  H  (y  —  y  )  .The  computations  were 
tabulated  in  Table  10.2  "The  Errors  in  Fitting  Data  with  the  Least 
Squares  Regression  Line".  SSE  is  the  sum  of  the  numbers  in  the  last 
column,  which  is  0.75. 

b.  The  numbers  SSxy  and  f)  |  were  already  computed  in  Note 
10.18  "Example  2"  in  the  process  of  finding  the  least  squares 
regression  line.  So  was  the  number  Hy  =  9.  We  must  compute 
SSyy  .  To  do  so  it  is  necessary  to  first  compute 

Zy2  =  0  +  l2  +  22  +  32  +  32  =  23.Then 

SS„  =Zy2~l  (Tv)  2  =  23  -  J  (9)2  =  6.8 

so  that 

SSE  =  SSyy  -  fiiSSxy  =  6.8  -  (0. 343 75 ) ( 1 7 . 6)  =  0.75 
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EXAMPLE  5 


Find  the  sum  of  the  squared  errors  SSE  for  the  least  squares  regression 
line  for  the  data  set,  presented  in  Table  10.3  "Data  on  Age  and  Value  of  Used 
Automobiles  of  a  Specific  Make  and  Model",  on  age  and  values  of  used 
vehicles  in  Note  10.19  "Example  3". 

Solution: 

From  Note  10.19  "Example  3"  we  already  know  that 

SSxy  =  -28.7,  ?!  =  -2.05,  and  Sy  -  246.3 

To  compute  SSyy  we  first  compute 

Sy2  =  28. 72  +  24. 82  +  26.02  +  30.52  +  23.82  +  24.62  +  23.82  +  20.42 

Then 


SSyy  =  Zy2  -  -  (2/y)2  =  6154.15  -  (246.3)2  =  87.781 


n 


10 


Therefore 

SSE  =  SSyy  -  'piSSw  =  87.781  -  (— 2.05)(— 28.7)  =  28.946 


KEY  TAKEAWAYS 


•  How  well  a  straight  line  fits  a  data  set  is  measured  by  the  sum  of  the 
squared  errors. 

•  The  least  squares  regression  line  is  the  line  that  best  fits  the  data.  Its 
slope  andy-intercept  are  computed  from  the  data  using  formulas. 

•  The  slope  (3  j  of  the  least  squares  regression  line  estimates  the  size  and 
direction  of  the  mean  change  in  the  dependent  variable  y  when  the 
independent  variable  x  is  increased  by  one  unit. 

•  The  sum  of  the  squared  errors  SSE  of  the  least  squares  regression  line 
can  be  computed  using  a  formula,  without  having  to  compute  all  the 
individual  errors. 
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For  the  Basic  and  Application  exercises  in  this  section  use  the  computations 
that  were  done  for  the  exercises  with  the  same  number  in  Section  10.2  "The 
Linear  Correlation  Coefficient". 


1.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  1  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

2.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  2  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

3.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  3  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

4.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  4  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

5.  For  the  data  in  Exercise  5  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Compute  the  sum  of  the  squared  errors  SSE  using  the  definition 

Z(y  -yf  ■ 

c.  Compute  the  sum  of  the  squared  errors  SSE  using  the  formula 

SSE  =  SSyy  -?1  SSjcy. 

6.  For  the  data  in  Exercise  6  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Compute  the  sum  of  the  squared  errors  SSE  using  the  definition 

ny-yf. 

c.  Compute  the  sum  of  the  squared  errors  SSE  using  the  formula 
SSE  =  SSyy  -?1  SSjcy. 

7.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  7  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

8.  Compute  the  least  squares  regression  line  for  the  data  in  Exercise  8  of  Section 

10.2  "The  Linear  Correlation  Coefficient". 

9.  For  the  data  in  Exercise  9  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
a.  Compute  the  least  squares  regression  line. 
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b.  Can  you  compute  the  sum  of  the  squared  errors  SSE  using  the  definition 
E(y  —  y  )  ?  Explain. 

c.  Compute  the  sum  of  the  squared  errors  SSE  using  the  formula 
SSE  =  SSyy  ~  /?!  SSjcy. 

10.  For  the  data  in  Exercise  10  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Can  you  compute  the  sum  of  the  squared  errors  SSE  using  the  definition 
£(y  —  y )  ?  Explain. 

c.  Compute  the  sum  of  the  squared  errors  SSE  using  the  formula 
SSE  =  SSyy  ~  0]  SSxy  . 


APPLICATIONS 


11.  For  the  data  in  Exercise  11  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  On  average,  how  many  new  words  does  a  child  from  13  to  18  months  old 
learn  each  month?  Explain. 

c.  Estimate  the  average  vocabulary  of  all  16-month-old  children. 

12.  For  the  data  in  Exercise  12  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  On  average,  how  many  additional  feet  are  added  to  the  braking  distance 
for  each  additional  100  pounds  of  weight?  Explain. 

c.  Estimate  the  average  braking  distance  of  all  cars  weighing  3,000  pounds. 

13.  For  the  data  in  Exercise  13  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Estimate  the  average  resting  heart  rate  of  all  40-year-old  men. 

c.  Estimate  the  average  resting  heart  rate  of  all  newborn  baby  boys. 

Comment  on  the  validity  of  the  estimate. 

14.  For  the  data  in  Exercise  14  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Estimate  the  average  wave  height  when  the  wind  is  blowing  at  10  miles  per 
hour. 

c.  Estimate  the  average  wave  height  when  there  is  no  wind  blowing. 
Comment  on  the  validity  of  the  estimate. 

15.  For  the  data  in  Exercise  15  of  Section  10.2  "The  Linear  Correlation  Coefficient" 


10.4  The  Least  Squares  Regression  Line 


575 


Chapter  10  Correlation  and  Regression 


a.  Compute  the  least  squares  regression  line. 

b.  On  average,  for  each  additional  thousand  dollars  spent  on  advertising,  how 
does  revenue  change?  Explain. 

c.  Estimate  the  revenue  if  $2,500  is  spent  on  advertising  next  year. 

16.  For  the  data  in  Exercise  16  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  On  average,  for  each  additional  inch  of  height  of  two-year-old  girl,  what  is 
the  change  in  the  adult  height?  Explain. 

c.  Predict  the  adult  height  of  a  two-year-old  girl  who  is  33  inches  tall. 

17.  For  the  data  in  Exercise  17  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Compute  SSE  using  the  formula  SSE  =  SSyy  —  [i  \  SSxy  . 

c.  Estimate  the  average  final  exam  score  of  all  students  whose  course  average 
just  before  the  exam  is  85. 

18.  For  the  data  in  Exercise  18  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Compute  SSE  using  the  formula  SSE  =  SSyy  —  [)  \  SSxy  • 

c.  Estimate  the  number  of  acres  that  would  be  harvested  if  90  million  acres  of 
corn  were  planted. 

19.  For  the  data  in  Exercise  19  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Interpret  the  value  of  the  slope  of  the  least  squares  regression  line  in  the 
context  of  the  problem. 

c.  Estimate  the  average  concentration  of  the  active  ingredient  in  the  blood  in 
men  after  consuming  1  ounce  of  the  medication. 

20.  For  the  data  in  Exercise  20  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  Interpret  the  value  of  the  slope  of  the  least  squares  regression  line  in  the 
context  of  the  problem. 

c.  Estimate  the  age  of  an  oak  tree  whose  girth  five  feet  off  the  ground  is  92 
inches. 

21.  For  the  data  in  Exercise  21  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 
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b.  The  28-day  strength  of  concrete  used  on  a  certain  job  must  be  at  least  3,200 
psi.  If  the  3-day  strength  is  1,300  psi,  would  we  anticipate  that  the  concrete 
will  be  sufficiently  strong  on  the  28th  day?  Explain  fully. 

22.  For  the  data  in  Exercise  22  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Compute  the  least  squares  regression  line. 

b.  if  the  power  facility  is  called  upon  to  provide  more  than  95  million  watt- 
hours  tomorrow  then  energy  will  have  to  be  purchased  from  elsewhere  at 
a  premium.  The  forecast  is  for  an  average  temperature  of  42  degrees. 
Should  the  company  plan  on  purchasing  power  at  a  premium? 


ADDITIONAL  EXERCISES 


23.  Verify  that  no  matter  what  the  data  are,  the  least  squares  regression  line 
always  passes  through  the  point  with  coordinates  (x ,  y  )  .  Hint:  Find  the 
predicted  value  of  y  when  X  =  X . 

24.  In  Exercise  1  you  computed  the  least  squares  regression  line  for  the  data  in 
Exercise  1  of  Section  10.2  "The  Linear  Correlation  Coefficient". 


a.  Reverse  the  roles  of  x  andy  and  compute  the  least  squares  regression  line 
for  the  new  data  set 


X 

2 

4 

6 

5 

9 

y 

0 

1 

3 

5 

8 

b.  Interchanging  x  andy  corresponds  geometrically  to  reflecting  the  scatter 
plot  in  a  45-degree  line.  Reflecting  the  regression  line  for  the  original  data 
the  same  way  gives  a  line  with  the  equation  _y  =  1.346x— 3.600.  Is 
this  the  equation  that  you  got  in  part  (a)?  Can  you  figure  out  why  not? 
Hint:  Think  about  how  x  andy  are  treated  differently  geometrically  in  the 
computation  of  the  goodness  of  fit. 

c.  Compute  SSE  for  each  line  and  see  if  they  fit  the  same,  or  if  one  fits  the 
data  better  than  the  other. 


LARGE  DATA  SET  EXERCISES 


25.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 
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a.  Compute  the  least  squares  regression  line  with  SAT  score  as  the 
independent  variable  (x)  and  GPA  as  the  dependent  variable  (y). 

b.  Interpret  the  meaning  of  the  slope  ft  j  of  regression  line  in  the  context  of 
problem. 

c.  Compute  SSE  ,  the  measure  of  the  goodness  of  fit  of  the  regression  line  to 
the  sample  data. 

d.  Estimate  the  GPA  of  a  student  whose  SAT  score  is  1350. 

26.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs). 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

a.  Compute  the  least  squares  regression  line  with  scores  using  the  original 
clubs  as  the  independent  variable  (x)  and  scores  using  the  new  clubs  as  the 
dependent  variable  (y). 

b.  Interpret  the  meaning  of  the  slope  p  j  of  regression  line  in  the  context  of 
problem. 

c.  Compute  SSE  ,  the  measure  of  the  goodness  of  fit  of  the  regression  line  to 
the  sample  data. 

d.  Estimate  the  score  with  the  new  clubs  of  a  golfer  whose  score  with  the  old 
clubs  is  73. 

27.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 

a.  Compute  the  least  squares  regression  line  with  the  number  of  bidders 
present  at  the  auction  as  the  independent  variable  (x)  and  sales  price  as 
the  dependent  variable  (y). 

/V 

b.  Interpret  the  meaning  of  the  slope  (5  j  of  regression  line  in  the  context  of 
problem. 

c.  Compute  SSE  ,  the  measure  of  the  goodness  of  fit  of  the  regression  line  to 
the  sample  data. 

d.  Estimate  the  sales  price  of  a  clock  at  an  auction  at  which  the  number  of 
bidders  is  seven. 
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ANSWERS 


i.  y  =  0.743x  +  2.675 
3.  y  =  -0.610x  +  4.082 
5.  £  =  0.625x  +  1.25  ,SSE  =  5 
7.  y  —  0.6x  +1.8 

9.  y  =  —  1.45x  +  2.4  ,  SSE  =  50.25  (cannot  use  the  definition  to 
compute) 

ii.  a.  y  =  4.848x— 56  , 

b.  4.8, 

c.  21.6 

is.  a.  £  =  0.114x  +  69.222  , 

b.  73.8, 

c.  69.2,  invalid  extrapolation 

15.  a.  £  =  42.024x  +  119.502  , 

b.  increases  by  $42,024, 

c.  $224,562 

17.  a.  y  =  1.045x— 8.527  , 

b.  2151.93367, 

c.  80.3 

19.  a.  y  =  0.043x  +  0.001  , 

b.  For  each  additional  ounce  of  medication  consumed  blood  concentration  of 
the  active  ingredient  increases  by  0.043  %, 

c.  0.044% 

2i.  a.  y  =  2.550x  +  1.993  , 

b.  Predicted  28-day  strength  is  3,514  psi;  sufficiently  strong 

25.  a.  y  =  0.0016a  +  0.022 

b.  On  average,  every  100  point  increase  in  SAT  score  adds  0.16  point  to  the 
GPA. 

c.  SSE  =  432.10 

d.  y  =  2.182 

27.  a.  y  =  1  16.62a  +  6955.1 
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b.  On  average,  every  1  additional  bidder  at  an  auction  raises  the  price  by 
116.62  dollars. 

c.  SSE  =  1850314.08 

d.  y  =  7771.44 
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10.5  Statistical  Inferences  About  pi 


LEARNING  OBJECTIVES 

1.  To  learn  how  to  construct  a  confidence  interval  for  ,  the  slope  of  the 

population  regression  line. 

2.  To  learn  how  to  test  hypotheses  regarding  \ . 


The  parameter  ,  the  slope  of  the  population  regression  line,  is  of  primary 
importance  in  regression  analysis  because  it  gives  the  true  rate  of  change  in  the 
mean  E  (y)  in  response  to  a  unit  increase  in  the  predictor  variable  x.  For  every  unit 
increase  in  x  the  mean  of  the  response  variable  y  changes  by  units,  increasing  if 
/?!  >  0  and  decreasing  if  <  0.  We  wish  to  construct  confidence  intervals  for 
and  test  hypotheses  about  it. 

Confidence  Intervals  for  Pi 


The  slope  /i  j  of  the  least  squares  regression  line  is  a  point  estimate  of  .  A 
confidence  interval  for  //|  is  given  by  the  following  formula. 


100  (1  —  Oi)  %  Confidence  Interval  for  the  Slope  of 
the  Population  Regression  Line 


P\  ±  ta/2 


where  se 


and  the  number  of  degrees  of  freedom  is  df 


n—2. 


The  assumptions  listed  in  Section  10.3  "Modelling  Linear  Relationships  with 
Randomness  Present"  must  hold. 
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Definition 

The  statistic  se  is  called  the  sample  standard  deviation  of  errors7.  It  estimates  the 
standard  deviation  a  of  the  errors  in  the  population  ofy-values  for  each  fixed  value  ofx 
(see  Figure  10.5  "The  Simple  Linear  Model  Concept"  in  Section  10.3  "Modelling  Linear 
Relationships  with  Randomness  Present"). 


7.  The  statistic  S£. 
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Construct  the  95%  confidence  interval  for  the  slope  (i\  of  the  population 
regression  line  based  on  the  five-point  sample  data  set 


x  2  2  6  8  10 


y  0  12  3  3 


Solution: 


The  point  estimate  /?  \  of  was  computed  in  Note  10.18  "Example  2"  in 

Section  10.4  "The  Least  Squares  Regression  Line"  as  0  \  =  0.34375.  In 
the  same  example  SSXX  was  found  to  be  SSXX  =  51.2.  The  sum  of  the 
squared  errors  SSE  was  computed  in  Note  10.23  "Example  4"  in  Section 
10.4  "The  Least  Squares  Regression  Line"  as  SSE  =  0.75.  Thus 


Confidence  level  95%  means  a  =  1  —  0.95  =  0.05  so  a  /  2  =  0.025. 
From  the  row  labeled  df  =  3  in  Figure  12.3  "Critical  Values  of"  we  obtain 
^0.025  —  3.182.  Therefore 


/  \ 


P 1  ±  ta/2 


=  0.34375  ±  3.182  °'5Q  =  0.34375  ±  0.2223 


which  gives  the  interval  (0.  1215,0.  5661 )  .  We  are  95%  confident  that 
the  slope  /?j  of  the  population  regression  line  is  between  0.1215  and  0.5661. 
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EXAMPLE  7 


Using  the  sample  data  in  Table  10.3  "Data  on  Age  and  Value  of  Used 
Automobiles  of  a  Specific  Make  and  Model"  construct  a  90%  confidence 
interval  for  the  slope  fJ\  of  the  population  regression  line  relating  age  and 
value  of  the  automobiles  of  Note  10.19  "Example  3"  in  Section  10.4  "The 
Least  Squares  Regression  Line".  Interpret  the  result  in  the  context  of  the 
problem. 

Solution: 


The  point  estimate  P\  of  /?j  was  computed  in  Note  10.19  "Example  3".  as 
was  SSXX  .  Their  values  are  P\  —  —2.05  and^SiS**  =  14.  The  sum  of 
the  squared  errors  SSE  was  computed  in  Note  10.24  "Example  5"  in  Section 
10.4  "The  Least  Squares  Regression  Line"  as  SSE  =  28.946.  Thus 


Se  = 


/  28.946 


1.902169814 


Confidence  level  90%  means  Ct  =  1  —  0.90  =  0.10  sod  /  2  =  0.05. 
From  the  row  labeled  df  —  8  in  Figure  12.3  "Critical  Values  of"  we  obtain 
^0.05  —  1-860.  Therefore 


P 1  ±  ta/ 2 


-  -2.05  ±  1.860 


/ 

1.902169814 


Vl4  J 


-2.05  ±  0.95 


which  gives  the  interval  (—3.00,  —  1 . 10)  .  We  are  90%  confident  that  the 
slope  of  the  population  regression  line  is  between  -3.00  and  -1.10.  In  the 
context  of  the  problem  this  means  that  for  vehicles  of  this  make  and  model 
between  two  and  six  years  old  we  are  90%  confident  that  for  each  additional 
year  of  age  the  average  value  of  such  a  vehicle  decreases  by  between  $1,100 
and  $3,000. 


Testing  Hypotheses  About 

Hypotheses  regarding  can  be  tested  using  the  same  five-step  procedures,  either 
the  critical  value  approach  or  the  p-value  approach,  that  were  introduced  in  Section 
8.1  "The  Elements  of  Hypothesis  Testing"  and  Section  8.3  "The  Observed 
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Significance  of  a  Test"  of  Chapter  8  "Testing  Hypotheses".  The  null  hypothesis 
always  has  the  form  Hq  :  =  Bq  where  Bo  is  a  number  determined  from  the 

statement  of  the  problem.  The  three  forms  of  the  alternative  hypothesis,  with  the 
terminology  for  each  case,  are: 


Form  of  Ha 

Terminology 

Ha  :  /i,  <  Bq 

Left-tailed 

Ha-.fi!>  Bq 

Right-tailed 

Ha  :  A  #  Bq 

Two-tailed 

The  value  zero  for  Bo  is  of  particular  importance  since  in  that  case  the  null 
hypothesis  is  Hq  fiy  —  0,  which  corresponds  to  the  situation  in  which  x  is  not 
useful  for  predicting  y.  For  if  =0  then  the  population  regression  line  is 
horizontal,  so  the  mean  E  (y)  is  the  same  for  every  value  of  x  and  we  are  just  as 
well  off  in  ignoring  x  completely  and  approximating  y  by  its  average  value.  Given 
two  variables  x  andy,  the  burden  of  proof  is  that  x  is  useful  for  predicting  y,  not 
that  it  is  not.  Thus  the  phrase  “test  whether  x  is  useful  for  prediction  ofy,”  or  words 
to  that  effect,  means  to  perform  the  test 

Hq  :  A  =  0  vs.  Ha  :  fix  ±  0 

Standardized  Test  Statistic  for  Hypothesis  Tests 
Concerning  the  Slope  /3y  of  the  Population  Regression 
Line 


The  test  statistic  has  Student’s  t-distribution  with  df  —  n—2  degrees  of 
freedom. 

The  assumptions  listed  in  Section  10.3  "Modelling  Linear  Relationships  with 
Randomness  Present"  must  hold. 
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EXAMPLE  8 


Test,  at  the  2%  level  of  significance,  whether  the  variable  x  is  useful  for 
predicting  y  based  on  the  information  in  the  five-point  data  set 


X 

2 

2 

6 

8 

10 

J 

0 

1 

2 

3 

3 

Solution: 


We  will  perform  the  test  using  the  critical  value  approach. 


•  Step  1.  Since  x  is  useful  for  prediction  of  y  precisely  when  the 
slope  /?i  of  the  population  regression  line  is  nonzero,  the 
relevant  test  is 


ffo  :  A  =  0 

vs.  Ha  :  /?!  +  0  @  a  =  0.02 


•  Step  2.  The  test  statistic  is 


j  \J SSxx 

and  has  Student’s  t-distribution  with  n— 2  =  5  —  2  =  3 
degrees  of  freedom. 


•  Step  3.  From  Note  10.18  "Example  2",  0  \  =  0.34375  and 
SS^  —  51.2.  From  Note  10.30  "Example  6",  SP  =  0.50.  The 
value  of  the  test  statistic  is  therefore 

?1  -  Bo  0.34375 

T  =  ——t - —  =  —  - - =4.919 

se  /  yJSS^  0.50/  V5L2 

•  Step  4.  Since  the  symbol  in  Ha  is  this  is  a  two-tailed  test,  so  there  are 
two  critical  values  ±fa/2  —  ±fo  m  •  Reading  from  the  line  in  Figure 
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12.3  "Critical  Values  of"  labeled  df  =  3,  fo.01  —  4.541.  The 
rejection  region  is  (— oo,  — 4.54l]  U  [4.541,  co)  . 

•  Step  5.  As  shown  in  Figure  10.9  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 

The  data  provide  sufficient  evidence,  at  the  2%  level  of 
significance,  to  conclude  that  the  slope  of  the  population 
regression  line  is  nonzero,  so  that  x  is  useful  as  a  predictor  ofy. 


Figure  10.9 

Rejection  Region  and 
Test  Statistic  for  Note 
10.33  "Example  8" 
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EXAMPLE  9 


A  car  salesman  claims  that  automobiles  between  two  and  six  years  old  of  the 
make  and  model  discussed  in  Note  10.19  "Example  3"  in  Section  10.4  "The 
Least  Squares  Regression  Line"  lose  more  than  $1,100  in  value  each  year. 
Test  this  claim  at  the  5%  level  of  significance. 

Solution: 

We  will  perform  the  test  using  the  critical  value  approach. 


•  Step  1.  In  terms  of  the  variables  x  and  y,  the  salesman’s  claim  is 
that  if  x  is  increased  by  1  unit  (one  additional  year  in  age),  theny 
decreases  by  more  than  1.1  units  (more  than  $1,100).  Thus  his 
assertion  is  that  the  slope  of  the  population  regression  line  is 
negative,  and  that  it  is  more  negative  than  -1.1.  In  symbols, 

A  <—1.1.  Since  it  contains  an  inequality,  this  has  to  be  the 
alternative  hypotheses.  The  null  hypothesis  has  to  be  an  equality 
and  have  the  same  number  on  the  right  hand  side,  so  the 
relevant  test  is 


Ho  :  A  =-1.1 

vs.  Ha  :  A  <  —1.1  @  a  =  0.05 

•  Step  2.  The  test  statistic  is 


T  = 


and  has  Student’s  t-distribution  with  8  degrees  of  freedom. 


•  Step  3.  From  Note  10.19  "Example  3",  [)  \  =  —2.05  and 
SSXX  —  14.  From  Note  10.31  "Example  7". 

Se  =  1.902169814.  The  value  of  the  test  statistic  is 
therefore 
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T  = 


?i  -Bo 


-2.05  -  (-1.1) 


1.902169814  /  yj  14 


-1.869 


Step  4.  Since  the  symbol  in  Ha  is  “<”  this  is  a  left-tailed  test,  so  there  is  a 
single  critical  value  —ta  =  —  fo.05  •  Reading  from  the  line  in  Figure  12,3 
"Critical  Values  of "  labeled  df  =  8,  ?o.05  —  1-860.  The  rejection 


region  is  (  — oo, 


•  Step  5.  As  shown  in  Figure  10.10  "Rejection  Region  and  Test 
Statistic  for  "  the  test  statistic  falls  in  the  rejection  region.  The 
decision  is  to  reject  Ho.  In  the  context  of  the  problem  our 
conclusion  is: 


The  data  provide  sufficient  evidence,  at  the  5%  level  of 
significance,  to  conclude  that  vehicles  of  this  make  and  model 
and  in  this  age  range  lose  more  than  $1,100  per  year  in  value,  on 
average. 


Figure  10.10 

Rejection  Region  and 
Test  Statistic  for  Note 

10.34  "Exanwle  9" 

Ha  :  A  <0 
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KEY  TAKEAWAYS 


•  The  parameter  ,  the  slope  of  the  population  regression  line,  is  of 

primary  interest  because  it  describes  the  average  change  in  y  with 

respect  to  unit  increase  in  x. 

/\ 

•  The  statistic  [)  j ,  the  slope  of  the  least  squares  regression  line,  is  a  point 

estimate  of  .  Confidence  intervals  for  /i|  can  be  computed  using  a 

formula. 

•  Hypotheses  regarding  / 3 j  are  tested  using  the  same  five-step  procedures 
introduced  in  Chapter  8  "Testing  Hypotheses"  . 
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For  the  Basic  and  Application  exercises  in  this  section  use  the  computations 
that  were  done  for  the  exercises  with  the  same  number  in  Section  10.2  "The 
Linear  Correlation  Coefficient"  and  Section  10.4  "The  Least  Squares 
Regression  Line". 

1.  Construct  the  95%  confidence  interval  for  the  slope  of  the  population 
regression  line  based  on  the  sample  data  set  of  Exercise  1  of  Section  10.2  "The 
Linear  Correlation  Coefficient". 

2.  Construct  the  90%  confidence  interval  for  the  slope  of  the  population 
regression  line  based  on  the  sample  data  set  of  Exercise  2  of  Section  10.2  "The 
Linear  Correlation  Coefficient". 

3.  Construct  the  90%  confidence  interval  for  the  slope  of  the  population 
regression  line  based  on  the  sample  data  set  of  Exercise  3  of  Section  10.2  "The 
Linear  Correlation  Coefficient". 

4.  Construct  the  99%  confidence  interval  for  the  slope  of  the  population 
regression  Exercise  4  of  Section  10.2  "The  Linear  Correlation  Coefficient". 

5.  For  the  data  in  Exercise  5  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  10%  level  of  significance,  whether  x  is  useful  for  predicting  y  (that 
is,  whether  7^  0). 

6.  For  the  data  in  Exercise  6  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

test,  at  the  5%  level  of  significance,  whether  x  is  useful  for  predicting  y  (that  is, 
whether  7^  0). 

7.  Construct  the  90%  confidence  interval  for  the  slope  of  the  population 
regression  line  based  on  the  sample  data  set  of  Exercise  7  of  Section  10.2  "The 
Linear  Correlation  Coefficient". 

8.  Construct  the  95%  confidence  interval  for  the  slope  of  the  population 
regression  line  based  on  the  sample  data  set  of  Exercise  8  of  Section  10.2  "The 
Linear  Correlation  Coefficient". 

9.  For  the  data  in  Exercise  9  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

test,  at  the  1%  level  of  significance,  whether  x  is  useful  for  predicting  y  (that  is, 
whether  7^  0). 
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10.  For  the  data  in  Exercise  10  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  1%  level  of  significance,  whether  x  is  useful  for  predicting  y  (that  is, 
whether  /?j  0). 


APPLICATIONS 


11.  For  the  data  in  Exercise  11  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
construct  a  90%  confidence  interval  for  the  mean  number  of  new  words 
acquired  per  month  by  children  between  13  and  18  months  of  age. 

12.  For  the  data  in  Exercise  12  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
construct  a  90%  confidence  interval  for  the  mean  increased  braking  distance 
for  each  additional  100  pounds  of  vehicle  weight. 

13.  For  the  data  in  Exercise  13  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  10%  level  of  significance,  whether  age  is  useful  for  predicting 
resting  heart  rate. 

14.  For  the  data  in  Exercise  14  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  10%  level  of  significance,  whether  wind  speed  is  useful  for 
predicting  wave  height. 

15.  For  the  situation  described  in  Exercise  15  of  Section  10.2  "The  Linear 
Correlation  Coefficient" 

a.  Construct  the  95%  confidence  interval  for  the  mean  increase  in  revenue 
per  additional  thousand  dollars  spent  on  advertising. 

b.  An  advertising  agency  tells  the  business  owner  that  for  every  additional 
thousand  dollars  spent  on  advertising,  revenue  will  increase  by  over 
$25,000.  Test  this  claim  (which  is  the  alternative  hypothesis)  at  the  5% 
level  of  significance. 

c.  Perform  the  test  of  part  (b)  at  the  10%  level  of  significance. 

d.  Based  on  the  results  in  (b)  and  (c),  how  believable  is  the  ad  agency’s  claim? 
(This  is  a  subjective  judgement.) 

16.  For  the  situation  described  in  Exercise  16  of  Section  10.2  "The  Linear 
Correlation  Coefficient" 

a.  Construct  the  90%  confidence  interval  for  the  mean  increase  in  height  per 
additional  inch  of  length  at  age  two. 

b.  It  is  claimed  that  for  girls  each  additional  inch  of  length  at  age  two  means 
more  than  an  additional  inch  of  height  at  maturity.  Test  this  claim  (which 
is  the  alternative  hypothesis)  at  the  10%  level  of  significance. 
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17.  For  the  data  in  Exercise  17  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  10%  level  of  significance,  whether  course  average  before  the  final 
exam  is  useful  for  predicting  the  final  exam  grade. 

18.  For  the  situation  described  in  Exercise  18  of  Section  10.2  "The  Linear 
Correlation  Coefficient",  an  agronomist  claims  that  each  additional  million 
acres  planted  results  in  more  than  750,000  additional  acres  harvested.  Test  this 
claim  at  the  1%  level  of  significance. 

19.  For  the  data  in  Exercise  19  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  l/lOth  of  1%  level  of  significance,  whether,  ignoring  all  other  facts 
such  as  age  and  body  mass,  the  amount  of  the  medication  consumed  is  a  useful 
predictor  of  blood  concentration  of  the  active  ingredient. 

20.  For  the  data  in  Exercise  20  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
test,  at  the  1%  level  of  significance,  whether  for  each  additional  inch  of  girth 
the  age  of  the  tree  increases  by  at  least  two  and  one-half  years. 

21.  For  the  data  in  Exercise  21  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Construct  the  95%  confidence  interval  for  the  mean  increase  in  strength  at 
28  days  for  each  additional  hundred  psi  increase  in  strength  at  3  days. 

b.  Test,  at  the  l/lOth  of  1%  level  of  significance,  whether  the  3-day  strength 
is  useful  for  predicting  28-day  strength. 

22.  For  the  situation  described  in  Exercise  22  of  Section  10.2  "The  Linear 
Correlation  Coefficient" 

a.  Construct  the  99%  confidence  interval  for  the  mean  decrease  in  energy 
demand  for  each  one-degree  drop  in  temperature. 

b.  An  engineer  with  the  power  company  believes  that  for  each  one-degree 
increase  in  temperature,  daily  energy  demand  will  decrease  by  more  than 
3.6  million  watt-hours.  Test  this  claim  at  the  1%  level  of  significance. 


LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

a.  Compute  the  90%  confidence  interval  for  the  slope  //j  of  the  population 
regression  line  with  SAT  score  as  the  independent  variable  (x)  and  GPA  as 
the  dependent  variable  (y). 
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b.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  slope  of  the 
population  regression  line  is  greater  than  0.001,  against  the  null 
hypothesis  that  it  is  exactly  0.001. 

24.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs). 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

a.  Compute  the  95%  confidence  interval  for  the  slope  /?j  of  the  population 
regression  line  with  scores  using  the  original  clubs  as  the  independent 
variable  (x)  and  scores  using  the  new  clubs  as  the  dependent  variable  (y). 

b.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  slope  of  the 
population  regression  line  is  different  from  1,  against  the  null  hypothesis 
that  it  is  exactly  1. 

25.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 

a.  Compute  the  95%  confidence  interval  for  the  slope  of  the  population 

regression  line  with  the  number  of  bidders  present  at  the  auction  as  the 
independent  variable  (x)  and  sales  price  as  the  dependent  variable  (y). 

b.  Test,  at  the  10%  level  of  significance,  the  hypothesis  that  the  average  sales 
price  increases  by  more  than  $90  for  each  additional  bidder  at  an  auction, 
against  the  default  that  it  increases  by  exactly  $90. 
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ANSWERS 


i.  0.743  ±0.578 
3.  -0.610  ±0.633 

5.  T  =  1.732  ,  ±/q.05  =  ±2.353  ,  do  not  reject  Ho 
7.  0.6  ±0.451 

9.  T  —  —4.481  ,  ±?0.005  —  ±3.355  ,  reject  Ho 
11.  4.8  ±1.7  words 

13.  T  —  2.843  ,  ±?o. 05  —  ±1 .860  ,  reject  Ho 

15.  a.  42.024  ±  28.01 1  thousand  dollars, 

b .  T  =  1. 487,7q.05  —  1-943  ,  do  not  reject  Ho; 

c.  ^0.10  —  1-440,  reject  Ho 

17.  T  =  4.096  ,  ±?o.05  —  ±1.771  ,  reject  Ho 

19.  T  =  25.524  ,  ±7o.0005  —  ±3. 505,  reject  Ho 

21.  a.  2.550  ±  0.127  hundredpsi, 

b.  T  =  41.072  ,  ±?0.005  —  ±3.674  ,  reject  Ho 

23.  a.  (0.0014,0.0018) 

b.  Ho  =  0.001  vs.Ha  :  >  0.001.  Test  Statistic: 

Z  =  6.1625.  Rejection  Region:  [l.28, +oo)  .  Decision:  Reject  Ho. 

25.  a.  (101.789,131.4435) 

b.  Hq  :  =  90 vs.Ha  :  >  90. Test  Statistic:  T  =  3.5938. 

d.f.  =  58.  Rejection  Region:  [l.296,  +oo)  .  Decision:  Reject  Hq. 
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10.6  The  Coefficient  of  Determination 


LEARNING  OBJECTIVE 

1.  To  learn  what  the  coefficient  of  determination  is,  how  to  compute  it,  and 
what  it  tells  us  about  the  relationship  between  two  variables  x  and  y. 


If  the  scatter  diagram  of  a  set  of  (x,  }')  pairs  shows  neither  an  upward  or  downward 
trend,  then  the  horizontal  line  y  —  y  fits  it  well,  as  illustrated  in  Figure  10.11.  The 
lack  of  any  upward  or  downward  trend  means  that  when  an  element  of  the 
population  is  selected  at  random,  knowing  the  value  of  the  measurement  x  for  that 
element  is  not  helpful  in  predicting  the  value  of  the  measurement  y. 


Figure  10.11 
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if  the  scatter  diagram  shows  a  linear  trend  upward  or  downward  then  it  is  useful  to 
compute  the  least  squares  regression  line  y  —  P\X  +  (3q  and  use  it  in  predicting y. 
Figure  10.12  "Same  Scatter  Diagram  with  Two  Approximating  Lines"  illustrates  this. 
In  each  panel  we  have  plotted  the  height  and  weight  data  of  Section  10.1  "Linear 
Relationships  Between  Variables".  This  is  the  same  scatter  plot  as  Figure  10.2  "Plot 
of  Height  and  Weight  Pairs",  with  the  average  value  line  y  —  y  superimposed  on  it 
in  the  left  panel  and  the  least  squares  regression  line  imposed  on  it  in  the  right 
panel.  The  errors  are  indicated  graphically  by  the  vertical  line  segments. 


Figure  10.12  Same  Scatter  Diagram  with  Two  Approximating  Lines 


(a):  Set  I  (b):  Set  II 


The  sum  of  the  squared  errors  computed  for  the  regression  line,  SSE,  is  smaller 
than  the  sum  of  the  squared  errors  computed  for  any  other  line.  In  particular  it  is 
less  than  the  sum  of  the  squared  errors  computed  using  the  line  y  —  y,  which  sum 
is  actually  the  number  SSyy  that  we  have  seen  several  times  already.  A  measure  of 
how  useful  it  is  to  use  the  regression  equation  for  prediction  ofy  is  how  much 
smaller  SSE  is  than  SSyy  .  In  particular,  the  proportion  of  the  sum  of  the  squared 
errors  for  the  line  y  —  y  that  is  eliminated  by  going  over  to  the  least  squares 
regression  line  is 

SSyy  ~  SSE  _  SSyy  SSE  _  ^  SSE 
SS^y  ~~SS^~~SS^~  ~~ss^ 

We  can  think  of  SSE  /  SSyy  as  the  proportion  of  the  variability  in  y  that  cannot  be 
accounted  for  by  the  linear  relationship  between  x  andy,  since  it  is  still  there  even 
when  x  is  taken  into  account  in  the  best  way  possible  (using  the  least  squares 
regression  line;  remember  that  SSE  is  the  smallest  the  sum  of  the  squared  errors 
can  be  for  any  line).  Seen  in  this  light,  the  coefficient  of  determination,  the 
complementary  proportion  of  the  variability  in  y,  is  the  proportion  of  the 
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variability  in  all  the  y  measurements  that  is  accounted  for  by  the  linear  relationship 
between  x  andy. 


In  the  context  of  linear  regression  the  coefficient  of  determination  is  always  the 
square  of  the  correlation  coefficient  r  discussed  in  Section  10.2  "The  Linear 
Correlation  Coefficient".  Thus  the  coefficient  of  determination  is  denoted  r2,  and  we 
have  two  additional  formulas  for  computing  it. 


Definition 


The  coefficient  of  determination8  of  a  collection  of  (x,  y)  pairs  is  the  number  r2 
computed  by  any  of  the  following  three  expressions: 


r2  = 


SSyy  -  SSE 


SS 


yy 


ssf 

SSa  SSyy 


Pi 


SS 

~ss 


xy 


yy 


It  measures  the  proportion  of  the  variability  in  y  that  is  accounted  for  by  the  linear 
relationship  between  x  andy. 


if  the  correlation  coefficient  r  is  already  known  then  the  coefficient  of 
determination  can  be  computed  simply  by  squaring  r,  as  the  notation  indicates, 
r2  =  (rf . 


8.  A  number  that  measures  the 
proportion  of  the  variability  in 
y  that  is  explained  by  x. 
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EXAMPLE  10 


The  value  of  used  vehicles  of  the  make  and  model  discussed  in  Note  10.19 
"Example  3"  in  Section  10.4  "The  Least  Squares  Regression  Line"  varies 
widely.  The  most  expensive  automobile  in  the  sample  in  Table  10.3  "Data  on 
Age  and  Value  of  Used  Automobiles  of  a  Specific  Make  and  Model"  has  value 
$30,500,  which  is  nearly  half  again  as  much  as  the  least  expensive  one,  which 
is  worth  $20,400.  Lind  the  proportion  of  the  variability  in  value  that  is 
accounted  for  by  the  linear  relationship  between  age  and  value. 

Solution: 

The  proportion  of  the  variability  in  value  y  that  is  accounted  for  by  the 
linear  relationship  between  it  and  age  x  is  given  by  the  coefficient  of 
determination,  r2.  Since  the  correlation  coefficient  r  was  already  computed 

in  Note  10.19  "Example  3"  as  r  =  — 0.819,  T”  =  (—0.819)”  =  0.671. 
About  67%  of  the  variability  in  the  value  of  this  vehicle  can  be  explained  by 
its  age. 
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EXAMPLE  11 


Use  each  of  the  three  formulas  for  the  coefficient  of  determination  to 
compute  its  value  for  the  example  of  ages  and  values  of  vehicles. 

Solution: 


In  Note  10.19  "Example  3"  in  Section  10.4  "The  Least  Squares  Regression 
Line"  we  computed  the  exact  values 

SSXX  =  14  SSxy  =  -28.7  SSyy  =  87.781  'pl  =  -2.05 

In  Note  10.24  "Example  5"  in  Section  10.4  "The  Least  Squares  Regression 
Line"  we  computed  the  exact  value 

SSE  =  28.946 


Inserting  these  values  into  the  formulas  in  the  definition,  one  after  the 
other,  gives 


r 


2 


r 


2 


r 


2 


XV 


-SSE 


ss 


yy 


87.781  -  28.946 
87.781 


0.6702475479 


ssl, 

ssass„ 


ss 


xy 


SS 


yy 


(-28.7)2 
(14)  (87.781) 


-2.05 


-28.7 

87.781 


-  0.6702475479 

0.6702475479 


which  rounds  to  0.670.  The  discrepancy  between  the  value  here  and  in  the 
previous  example  is  because  a  rounded  value  of  r  from  Note  10.19  "Example 
3"  was  used  there.  The  actual  value  of  r  before  rounding  is  0.8186864772, 
which  when  squared  gives  the  value  for  r2  obtained  here. 


The  coefficient  of  determination  r2  can  always  be  computed  by  squaring  the 
correlation  coefficient  r  if  it  is  known.  Any  one  of  the  defining  formulas  can  also  be 
used.  Typically  one  would  make  the  choice  based  on  which  quantities  have  already 
been  computed.  What  should  be  avoided  is  trying  to  compute  r  by  taking  the  square 
root  of  r2,  if  it  is  already  known,  since  it  is  easy  to  make  a  sign  error  this  way.  To  see 
what  can  go  wrong,  suppose  r2  =  0.64.  Taking  the  square  root  of  a  positive 
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number  with  any  calculating  device  will  always  return  a  positive  result.  The  square 
root  of  0.64  is  0.8.  However,  the  actual  value  of  r  might  be  the  negative  number  -0.8. 


KEY  TAKEAWAYS 


•  The  coefficient  of  determination  r2  estimates  the  proportion  of  the 
variability  in  the  variable  y  that  is  explained  by  the  linear  relationship 
between  y  and  the  variable  x. 

•  There  are  several  formulas  for  computing  r2.  The  choice  of  which  one  to 
use  can  be  based  on  which  quantities  have  already  been  computed  so 
far. 
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For  the  Basic  and  Application  exercises  in  this  section  use  the  computations 
that  were  done  for  the  exercises  with  the  same  number  in  Section  10.2  "The 
Linear  Correlation  Coefficient"  Section  10.4  "The  Least  Squares  Regression 
Line",  and  Section  10.5  "Statistical  Inferences  About 

1.  For  the  sample  data  set  of  Exercise  1  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

r1  =  f3  \  SS xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

2.  For  the  sample  data  set  of  Exercise  2  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

T2  =  Pi  SS  xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

3.  For  the  sample  data  set  of  Exercise  3  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

y2  =  P\  SS  xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

4.  For  the  sample  data  set  of  Exercise  4  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

T2  =  P  i  SS  xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

5.  For  the  sample  data  set  of  Exercise  5  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

T2  =  P\  SS  xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

6.  For  the  sample  data  set  of  Exercise  6  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

T2  =  P  i  SS  xy  j  SSyy  .  Confirm  your  answer  by  squaring  r  as  computed  in 
that  exercise. 

7.  For  the  sample  data  set  of  Exercise  7  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 
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r2  =  ( S  S yy  —  SSE)  j  SSyy.  Confirm  your  answer  by  squaring  r  as 

computed  in  that  exercise. 

8.  For  the  sample  data  set  of  Exercise  8  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

r2  =  ( SSyy  —  SSE)  j  SSyy.  Confirm  your  answer  by  squaring  r  as 

computed  in  that  exercise. 

9.  For  the  sample  data  set  of  Exercise  9  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 

r2  =  ( SSyy  —  SSE  )  j  SSyy.  Confirm  your  answer  by  squaring  r  as 

computed  in  that  exercise. 

10.  For  the  sample  data  set  of  Exercise  9  of  Section  10.2  "The  Linear  Correlation 
Coefficient"  find  the  coefficient  of  determination  using  the  formula 
r2  =  ( SSyy  —  SSE  )  j  SSyy.  Confirm  your  answer  by  squaring  r  as 

computed  in  that  exercise. 


APPLICATIONS 


11.  For  the  data  in  Exercise  11  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  age  and  vocabulary. 

12.  For  the  data  in  Exercise  12  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  vehicle  weight  and  braking  distance. 

13.  For  the  data  in  Exercise  13  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  age  and  resting  heart  rate.  In  the  age  range  of  the  data,  does  age  seem  to  be 
a  very  important  factor  with  regard  to  heart  rate? 

14.  For  the  data  in  Exercise  14  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  wind  speed  and  wave  height.  Does  wind  speed  seem  to  be  a  very  important 
factor  with  regard  to  wave  height? 

15.  For  the  data  in  Exercise  15  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
find  the  proportion  of  the  variability  in  revenue  that  is  explained  by  level  of 
advertising. 
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16.  For  the  data  in  Exercise  16  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
find  the  proportion  of  the  variability  in  adult  height  that  is  explained  by  the 
variation  in  length  at  age  two. 

17.  For  the  data  in  Exercise  17  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  course  average  before  the  final  exam  and  score  on  the  final  exam. 

18.  For  the  data  in  Exercise  18  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  acres  planted  and  acres  harvested. 

19.  For  the  data  in  Exercise  19  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  the  amount  of  the  medication  consumed  and  blood  concentration  of  the 
active  ingredient. 

20.  For  the  data  in  Exercise  20  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
compute  the  coefficient  of  determination  and  interpret  its  value  in  the  context 
of  tree  size  and  age. 

21.  For  the  data  in  Exercise  21  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
find  the  proportion  of  the  variability  in  28-day  strength  of  concrete  that  is 
accounted  for  by  variation  in  3-day  strength. 

22.  For  the  data  in  Exercise  22  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
find  the  proportion  of  the  variability  in  energy  demand  that  is  accounted  for 
by  variation  in  average  temperature. 


LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students.  Compute  the 
coefficient  of  determination  and  interpret  its  value  in  the  context  of  SAT 
scores  and  GPAs. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 

24.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs).  Compute  the 
coefficient  of  determination  and  interpret  its  value  in  the  context  of  golf 
scores  with  the  two  kinds  of  golf  clubs. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 
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25.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions.  Compute  the  coefficient  of 
determination  and  interpret  its  value  in  the  context  of  the  number  of  bidders 
at  an  auction  and  the  price  of  this  type  of  antique  grandfather  clock. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 


ANSWERS 


1.  0.848 
3.  0.631 
5.  0.5 
7.  0.766 
9.  0.715 

11.  0.898;  about  90%  of  the  variability  in  vocabulary  is  explained  by  age 

13.  0.503;  about  50%  of  the  variability  in  heart  rate  is  explained  by  age.  Age  is  a 
significant  but  not  dominant  factor  in  explaining  heart  rate. 

15.  The  proportion  is  r2  =  0.692. 

17.  0.563;  about  56%  of  the  variability  in  final  exam  scores  is  explained  by  course 
average  before  the  final  exam 

19.  0.931;  about  93%  of  the  variability  in  the  blood  concentration  of  the  active 
ingredient  is  explained  by  the  amount  of  the  medication  consumed 

21.  The  proportion  is  r2  =  0.984. 

23.  r2  =  21.17%. 

25.  r2  =  81.04%. 
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10.7  Estimation  and  Prediction 


LEARNING  OBJECTIVES 

1.  To  learn  the  distinction  between  estimation  and  prediction. 

2.  To  learn  the  distinction  between  a  confidence  interval  and  a  prediction 
interval. 

3.  To  learn  how  to  implement  formulas  for  computing  confidence  intervals 
and  prediction  intervals. 


Consider  the  following  pairs  of  problems,  in  the  context  of  Note  10.19  "Example  3" 
in  Section  10.4  "The  Least  Squares  Regression  Line",  the  automobile  age  and  value 
example. 


1. 


1.  Estimate  the  average  value  of  all  four-year-old  automobiles  of  this 
make  and  model. 

2.  Construct  a  95%  confidence  interval  for  the  average  value  of  all 
four-year-old  automobiles  of  this  make  and  model. 

2. 

1.  Shylock  intends  to  buy  a  four-year-old  automobile  of  this  make 
and  model  next  week.  Predict  the  value  of  the  first  such 
automobile  that  he  encounters. 

2.  Construct  a  95%  confidence  interval  for  the  value  of  the  first  such 
automobile  that  he  encounters. 


The  method  of  solution  and  answer  to  the  first  question  in  each  pair,  (la)  and  (2a), 
are  the  same.  When  we  set  x  equal  to  4  in  the  least  squares  regression  equation 
y  =  —2.05x  +  32.83  that  was  computed  in  part  (c)  of  Note  10.19  "Example  3"  in 
Section  10.4  "The  Least  Squares  Regression  Line",  the  number  returned, 

y  =  -2.05  (4)  +  32.83  =  24.63 

which  corresponds  to  value  $24,630,  is  an  estimate  of  precisely  the  number  sought 
in  question  (la):  the  mean  E  (y)  of  ally  values  when  x  =  4.  Since  nothing  is  known 
about  the  first  four-year-old  automobile  of  this  make  and  model  that  Shylock  will 
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encounter,  our  best  guess  as  to  its  value  is  the  mean  value  E  (y)  of  all  such 
automobiles,  the  number  24.63  or  $24,630,  computed  in  the  same  way. 


The  answers  to  the  second  part  of  each  question  differ.  In  question  (lb)  we  are 
trying  to  estimate  a  population  parameter:  the  mean  of  the  all  the  y-values  in  the 
sub-population  picked  out  by  the  value  x  =  4,  that  is,  the  average  value  of  all  four- 
year-old  automobiles.  In  question  (2b),  however,  we  are  not  trying  to  capture  a 
fixed  parameter,  but  the  value  of  the  random  variable  y  in  one  trial  of  an 
experiment:  examine  the  first  four-year-old  car  Shylock  encounters.  In  the  first 
case  we  seek  to  construct  a  confidence  interval  in  the  same  sense  that  we  have  done 
before.  In  the  second  case  the  situation  is  different,  and  the  interval  constructed 
has  a  different  name,  prediction  interval.  In  the  second  case  we  are  trying  to 
“predict”  where  a  the  value  of  a  random  variable  will  take  its  value. 


100  (1  —  a)  %  Confidence  Interval  for  the  Mean  Value 
ofy  at  X  =  Xp 


9p  ±  ta/2  Se 


(xp  -xY 
SSXX 


where 


a.  Xp  is  a  particular  value  of  x  that  lies  in  the  range  of  x-values  in  the 
sample  data  set  used  to  construct  the  least  squares  regression  line; 

b.  y  is  the  numerical  value  obtained  when  the  least  square 
regression  equation  is  evaluated  at  X  —  xp;  and 

c.  the  number  of  degrees  of  freedom  for  ta/2  is  df  —  n— 2. 


The  assumptions  listed  in  Section  10.3  "Modelling  Linear  Relationships  with 
Randomness  Present"  must  hold. 


The  formula  for  the  prediction  interval  is  identical  except  for  the  presence  of  the 
number  1  underneath  the  square  root  sign.  This  means  that  the  prediction  interval 
is  always  wider  than  the  confidence  interval  at  the  same  confidence  level  and  value 
of  x.  In  practice  the  presence  of  the  number  1  tends  to  make  it  much  wider. 
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100  (1  —  a)  %  Prediction  Interval  for  an  Individual 
New  Value  of  y  at  X  =  Xp 


where 

a.  xp  is  a  particular  value  of  x  that  lies  in  the  range  of  x-values  in  the 
data  set  used  to  construct  the  least  squares  regression  line; 

b.  y p  is  the  numerical  value  obtained  when  the  least  square 
regression  equation  is  evaluated  at  x  =  xp;  and 

c.  the  number  of  degrees  of  freedom  for  ta/2^df  —  n— 2. 


The  assumptions  listed  in  Section  10.3  "Modelling  Linear  Relationships  with 
Randomness  Present"  must  hold. 
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EXAMPLE  12 


Using  the  sample  data  of  Note  10.19  "Example  3"  in  Section  10.4  "The  Least 
Squares  Regression  Line",  recorded  in  Table  10.3  "Data  on  Age  and  Value  of 
Used  Automobiles  of  a  Specific  Make  and  Model",  construct  a  95% 
confidence  interval  for  the  average  value  of  all  three-and-one-half-year-old 
automobiles  of  this  make  and  model. 

Solution: 


Solving  this  problem  is  merely  a  matter  of  finding  the  values  of  y  p,  a  and 
ta/2 ,  Se ,  X ,  and  SSXX  and  inserting  them  into  the  confidence  interval 
formula  given  just  above.  Most  of  these  quantities  are  already  known.  From 
Note  10.19  "Example  3"  in  Section  10.4  "The  Least  Squares  Regression  Line". 
SSXX  =  14  andx  =  4.  From  Note  10.31  "Example  7"  in  Section  10.5 
"Statistical  Inferences  About  "  Se  =  1.902169814. 

From  the  statement  of  the  problem  Xp  =  3.5,  the  value  of  x  of  interest. 
The  value  of  y  p  is  the  number  given  by  the  regression  equation,  which  by 
Note  10.19  "Example  3"  is  y  =  — 2.05x  +  32.83  ,  whenX  =  Xp,  that  is, 
when  x  =  3.5.  Thus  here  yp  =  —2.05  (3.5)  +  32.83  =  25.655. 

Lastly,  confidence  level  95%  means  that  a  =  1  —  0.95  =  0.05  so 
a  I  2  =  0.025.  Since  the  sample  size  is  n  =  10,  there  are  n— 2  =  8 
degrees  of  freedom.  By  Figure  12.3  "Critical  Values  of",  fo  n?s  =  2.306. 
Thus 


yp  ±  fa/2  ^ 


(xp  -xf 
SSXX 


25.655  ±  (2.306)  (1.902169814) 

25.655  ±  4.386403591  V0.117857142( 
25.655  ±  1.506 


which  gives  the  interval  (24.  149,27.  161)  . 


We  are  95%  confident  that  the  average  value  of  all  three-and-one-half-year- 
old  vehicles  of  this  make  and  model  is  between  $24,149  and  $27,161. 
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EXAMPLE  13 


Using  the  sample  data  of  Note  10.19  "Example  3"  in  Section  10.4  "The  Least 
Squares  Regression  Line",  recorded  in  Table  10.3  "Data  on  Age  and  Value  of 
Used  Automobiles  of  a  Specific  Make  and  Model",  construct  a  95%  prediction 
interval  for  the  predicted  value  of  a  randomly  selected  three-and-one-half- 
year-old  automobile  of  this  make  and  model. 

Solution: 

The  computations  for  this  example  are  identical  to  those  of  the  previous 
example,  except  that  now  there  is  the  extra  number  1  beneath  the  square 
root  sign.  Since  we  were  careful  to  record  the  intermediate  results  of  that 
computation,  we  have  immediately  that  the  95%  prediction  interval  is 


/  1  (jc  —  X ) 

yp  ±  taj 2  Se  \  1  +  -  +  — —  -  25.655  ±  4.386403591^1-1178571 

V  n  5oxx 

which  gives  the  interval  (21. 017,30.  293)  . 

We  are  95%  confident  that  the  value  of  a  randomly  selected  three-and-one- 
half-year-old  vehicle  of  this  make  and  model  is  between  $21,017  and  $30,293. 

Note  what  an  enormous  difference  the  presence  of  the  extra  number  1  under 
the  square  root  sign  made.  The  prediction  interval  is  about  two-and-one- 
half  times  wider  than  the  confidence  interval  at  the  same  level  of 
confidence. 


KEY  TAKEAWAYS 


•  A  confidence  interval  is  used  to  estimate  the  mean  value  of  y  in  the  sub¬ 
population  determined  by  the  condition  that  x  have  some  specific  value 
xp. 

•  The  prediction  interval  is  used  to  predict  the  value  that  the  random 
variable  y  will  take  when  x  has  some  specific  value  xp. 
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For  the  Basic  and  Application  exercises  in  this  section  use  the  computations 
that  were  done  for  the  exercises  with  the  same  number  in  previous  sections. 

1.  For  the  sample  data  set  of  Exercise  1  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  ofy  in  the  sub-population 
determined  by  the  condition  x  =  4. 

b.  Construct  the  90%  confidence  interval  for  that  mean  value. 

2.  For  the  sample  data  set  of  Exercise  2  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  4. 

b.  Construct  the  90%  confidence  interval  for  that  mean  value. 

3.  For  the  sample  data  set  of  Exercise  3  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  ofy  in  the  sub-population 
determined  by  the  condition  x  =  7. 

b.  Construct  the  95%  confidence  interval  for  that  mean  value. 

4.  For  the  sample  data  set  of  Exercise  4  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  2. 

b.  Construct  the  80%  confidence  interval  for  that  mean  value. 

5.  For  the  sample  data  set  of  Exercise  5  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  1. 

b.  Construct  the  80%  confidence  interval  for  that  mean  value. 

6.  For  the  sample  data  set  of  Exercise  6  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 
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a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  5. 

b.  Construct  the  95%  confidence  interval  for  that  mean  value. 

7.  For  the  sample  data  set  of  Exercise  7  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  ofy  in  the  sub-population 
determined  by  the  condition  x  =  6. 

b.  Construct  the  99%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  x  =  12?  Explain. 

8.  For  the  sample  data  set  of  Exercise  8  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  12. 

b.  Construct  the  80%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  x  =  0?  Explain. 

9.  For  the  sample  data  set  of  Exercise  9  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  of  y  in  the  sub-population 
determined  by  the  condition  x  =  0. 

b.  Construct  the  90%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  X  =  —  1  ?  Explain. 

10.  For  the  sample  data  set  of  Exercise  9  of  Section  10.2  "The  Linear  Correlation 
Coefficient" 

a.  Give  a  point  estimate  for  the  mean  value  ofy  in  the  sub-population 
determined  by  the  condition  x  =  8. 

b.  Construct  the  95%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  x  =  0?  Explain. 


APPLICATIONS 


11.  For  the  data  in  Exercise  11  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Give  a  point  estimate  for  the  average  number  of  words  in  the  vocabulary  of 
18-month-old  children. 

b.  Construct  the  95%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  two-year-olds?  Explain. 

12.  For  the  data  in  Exercise  12  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
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a.  Give  a  point  estimate  for  the  average  braking  distance  of  automobiles  that 
weigh  3,250  pounds. 

b.  Construct  the  80%  confidence  interval  for  that  mean  value. 

c.  Is  it  valid  to  make  the  same  estimates  for  5,000-pound  automobiles? 
Explain. 

13.  For  the  data  in  Exercise  13  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Give  a  point  estimate  for  the  resting  heart  rate  of  a  man  who  is  35  years 
old. 

b.  One  of  the  men  in  the  sample  is  35  years  old,  but  his  resting  heart  rate  is 
not  what  you  computed  in  part  (a).  Explain  why  this  is  not  a  contradiction. 

c.  Construct  the  90%  confidence  interval  for  the  mean  resting  heart  rate  of 
all  35-year-old  men. 

14.  For  the  data  in  Exercise  14  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Give  a  point  estimate  for  the  wave  height  when  the  wind  speed  is  13  miles 
per  hour. 

b.  One  of  the  wind  speeds  in  the  sample  is  13  miles  per  hour,  but  the  height  of 
waves  that  day  is  not  what  you  computed  in  part  (a).  Explain  why  this  is 
not  a  contradiction. 

c.  Construct  the  90%  confidence  interval  for  the  mean  wave  height  on  days 
when  the  wind  speed  is  13  miles  per  hour. 

15.  For  the  data  in  Exercise  15  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  The  business  owner  intends  to  spend  $2,500  on  advertising  next  year.  Give 
an  estimate  of  next  year’s  revenue  based  on  this  fact. 

b.  Construct  the  90%  prediction  interval  for  next  year’s  revenue,  based  on  the 
intent  to  spend  $2,500  on  advertising. 

16.  For  the  data  in  Exercise  16  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  A  two-year-old  girl  is  32.3  inches  long.  Predict  her  adult  height. 

b.  Construct  the  95%  prediction  interval  for  the  girl’s  adult  height. 

17.  For  the  data  in  Exercise  17  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Lodovico  has  a  78.6  average  in  his  physics  class  just  before  the  final.  Give  a 
point  estimate  of  what  his  final  exam  grade  will  be. 

b.  Explain  whether  an  interval  estimate  for  this  problem  is  a  confidence 
interval  or  a  prediction  interval. 

c.  Based  on  your  answer  to  (b),  construct  an  interval  estimate  for  Lodovico’s 
final  exam  grade  at  the  90%  level  of  confidence. 

18.  For  the  data  in  Exercise  18  of  Section  10.2  "The  Linear  Correlation  Coefficient" 
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a.  This  year  86.2  million  acres  of  corn  were  planted.  Give  a  point  estimate  of 
the  number  of  acres  that  will  be  harvested  this  year. 

b.  Explain  whether  an  interval  estimate  for  this  problem  is  a  confidence 
interval  or  a  prediction  interval. 

c.  Based  on  your  answer  to  (b),  construct  an  interval  estimate  for  the  number 
of  acres  that  will  be  harvested  this  year,  at  the  99%  level  of  confidence. 

19.  For  the  data  in  Exercise  19  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Give  a  point  estimate  for  the  blood  concentration  of  the  active  ingredient 
of  this  medication  in  a  man  who  has  consumed  1.5  ounces  of  the 
medication  just  recently. 

b.  Gratiano  just  consumed  1.5  ounces  of  this  medication  30  minutes  ago. 
Construct  a  95%  prediction  interval  for  the  concentration  of  the  active 
ingredient  in  his  blood  right  now. 

20.  For  the  data  in  Exercise  20  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  You  measure  the  girth  of  a  free-standing  oak  tree  five  feet  off  the  ground 
and  obtain  the  value  127  inches.  How  old  do  you  estimate  the  tree  to  be? 

b.  Construct  a  90%  prediction  interval  for  the  age  of  this  tree. 

21.  For  the  data  in  Exercise  21  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  A  test  cylinder  of  concrete  three  days  old  fails  at  1,750  psi.  Predict  what  the 
28-day  strength  of  the  concrete  will  be. 

b.  Construct  a  99%  prediction  interval  for  the  28-day  strength  of  this 
concrete. 

c.  Based  on  your  answer  to  (b),  what  would  be  the  minimum  28-day  strength 
you  could  expect  this  concrete  to  exhibit? 

22.  For  the  data  in  Exercise  22  of  Section  10.2  "The  Linear  Correlation  Coefficient" 

a.  Tomorrow’s  average  temperature  is  forecast  to  be  53  degrees.  Estimate  the 
energy  demand  tomorrow. 

b.  Construct  a  99%  prediction  interval  for  the  energy  demand  tomorrow. 

c.  Based  on  your  answer  to  (b),  what  would  be  the  minimum  demand  you 
could  expect? 


LARGE  DATA  SET  EXERCISES 


23.  Large  Data  Set  1  lists  the  SAT  scores  and  GPAs  of  1,000  students. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal.xls 
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a.  Give  a  point  estimate  of  the  mean  GPA  of  all  students  who  score  1350  on 
the  SAT. 

b.  Construct  a  90%  confidence  interval  for  the  mean  GPA  of  all  students  who 
score  1350  on  the  SAT. 

24.  Large  Data  Set  12  lists  the  golf  scores  on  one  round  of  golf  for  75  golfers  first 
using  their  own  original  clubs,  then  using  clubs  of  a  new,  experimental  design 
(after  two  months  of  familiarization  with  the  new  clubs). 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal2.xls 

a.  Thurio  averages  72  strokes  per  round  with  his  own  clubs.  Give  a  point 
estimate  for  his  score  on  one  round  if  he  switches  to  the  new  clubs. 

b.  Explain  whether  an  interval  estimate  for  this  problem  is  a  confidence 
interval  or  a  prediction  interval. 

c.  Based  on  your  answer  to  (b),  construct  an  interval  estimate  for  Thurio’s 
score  on  one  round  if  he  switches  to  the  new  clubs,  at  90%  confidence. 

25.  Large  Data  Set  13  records  the  number  of  bidders  and  sales  price  of  a  particular 
type  of  antique  grandfather  clock  at  60  auctions. 

http://www.gone.2012books.lardbucket.org/sites/all/files/datal3.xls 

a.  There  are  seven  likely  bidders  at  the  Verona  auction  today.  Give  a  point 
estimate  for  the  price  of  such  a  clock  at  today’s  auction. 

b.  Explain  whether  an  interval  estimate  for  this  problem  is  a  confidence 
interval  or  a  prediction  interval. 

c.  Based  on  your  answer  to  (b),  construct  an  interval  estimate  for  the  likely 
sale  price  of  such  a  clock  at  today’s  sale,  at  95%  confidence. 
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1. 

3. 

5. 

7. 

9. 

11. 

13. 

15. 

17. 

19. 

21. 

23. 

25. 
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ANSWERS 


a.  5.647, 

b.  5.647  ±  1.253 

a.  -0.188, 

b.  -0.188  ±3.041 

a.  1.875, 

b.  1.875  ±1.423 

a.  5.4, 

b.  5.4  ±  3.355  , 

c.  invalid  (extrapolation) 

a.  2.4, 

b.  2.4  ±  1.474, 

c.  valid  (-1  is  in  the  range  of  the  x-values  in  the  data  set) 

a.  31.3  words, 

b.  31.3  ±  7.1  words, 

c.  not  valid,  since  two  years  is  24  months,  hence  this  is  extrapolation 

a.  73.2  beats/ min, 

b.  The  man’s  heart  rate  is  not  the  predicted  average  for  all  men  his  age.  c. 
73.2  ±1.2  beats/min 

a.  $224,562, 

b.  $224,562  ±  $28,699 

a.  74, 

b.  Prediction  (one  person,  not  an  average  for  all  who  have  average  78.6 
before  the  final  exam), 

c.  74  ±24 

a.  0.066%, 

b.  0.066  ±0.034% 

a.  4,656  psi, 

b.  4,656  ±  321  psi, 

c.  4,656  -  321  =  4,335  psi 

a.  2.19 

b.  (2.1421,2.2316) 

a.  7771.39 

b.  A  prediction  interval. 
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c.  (7410.41,8132.38) 
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10.8  A  Complete  Example 


LEARNING  OBJECTIVE 

1.  To  see  a  complete  linear  correlation  and  regression  analysis,  in  a 
practical  setting,  as  a  cohesive  whole. 


In  the  preceding  sections  numerous  concepts  were  introduced  and  illustrated,  but 
the  analysis  was  broken  into  disjoint  pieces  by  sections.  In  this  section  we  will  go 
through  a  complete  example  of  the  use  of  correlation  and  regression  analysis  of 
data  from  start  to  finish,  touching  on  all  the  topics  of  this  chapter  in  sequence. 


In  general  educators  are  convinced  that,  all  other  factors  being  equal,  class 
attendance  has  a  significant  bearing  on  course  performance.  To  investigate  the 
relationship  between  attendance  and  performance,  an  education  researcher  selects 
for  study  a  multiple  section  introductory  statistics  course  at  a  large  university. 
Instructors  in  the  course  agree  to  keep  an  accurate  record  of  attendance 
throughout  one  semester.  At  the  end  of  the  semester  26  students  are  selected  a 
random.  For  each  student  in  the  sample  two  measurements  are  taken:  x,  the 
number  of  days  the  student  was  absent,  andy,  the  student’s  score  on  the  common 
final  exam  in  the  course.  The  data  are  summarized  in  Table  10.4  "Absence  and  Score 
Data". 


Table  10.4  Absence  and  Score  Data 


Absences 

Score 

Absences 

Score 

X 

y 

X 

y 

2 

76 

4 

41 

7 

29 

5 

63 

2 

96 

4 

88 

7 

63 

0 

98 

2 

79 

1 

99 

7 

71 

0 

89 

0 

88 

1 

96 
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Absences 

Score 

Absences 

Score 

X 

y 

X 

y 

0 

92 

3 

90 

6 

55 

1 

90 

6 

70 

3 

68 

2 

80 

1 

84 

2 

75 

3 

80 

1 

63 

1 

78 

A  scatter  plot  of  the  data  is  given  in  Figure  10.13  "Plot  of  the  Absence  and  Exam 
Score  Pairs".  There  is  a  downward  trend  in  the  plot  which  indicates  that  on  average 
students  with  more  absences  tend  to  do  worse  on  the  final  examination. 


Figure  10.13  Plot  of  the  Absence  and  Exam  Score  Pairs 


The  trend  observed  in  Figure  10.13  "Plot  of  the  Absence  and  Exam  Score  Pairs"  as 
well  as  the  fairly  constant  width  of  the  apparent  band  of  points  in  the  plot  makes  it 
reasonable  to  assume  a  relationship  between  x  andy  of  the  form 
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y  =  P\x  +  p{ o  +  £ 

where  fi\  and  /i()  are  unknown  parameters  and  £  is  a  normal  random  variable  with 
mean  zero  and  unknown  standard  deviation  o.  Note  carefully  that  this  model  is 
being  proposed  for  the  population  of  all  students  taking  this  course,  not  just  those 
taking  it  this  semester,  and  certainly  not  just  those  in  the  sample.  The  numbers  , 
Pq  ,  and  a  are  parameters  relating  to  this  large  population. 


First  we  perform  preliminary  computations  that  will  be  needed  later.  The  data  are 
processed  in  Table  10.5  "Processed  Absence  and  Score  Data". 


Table  10.5  Processed  Absence  and  Score  Data 


X 

y 

x2 

*y 

/ 

X 

y 

X2 

*y 

/ 

2 

76 

4 

152 

5776 

4 

41 

16 

164 

1681 

7 

29 

49 

203 

841 

5 

63 

25 

315 

3969 

2 

96 

4 

192 

9216 

4 

88 

16 

352 

7744 

7 

63 

49 

441 

3969 

0 

98 

0 

0 

9604 

2 

79 

4 

158 

6241 

1 

99 

1 

99 

9801 

7 

71 

49 

497 

5041 

0 

89 

0 

0 

7921 

0 

88 

0 

0 

7744 

1 

96 

1 

96 

9216 

0 

92 

0 

0 

8464 

3 

90 

9 

270 

8100 

6 

55 

36 

330 

3025 

1 

90 

1 

90 

8100 

6 

70 

36 

420 

4900 

3 

68 

9 

204 

4624 

2 

80 

4 

160 

6400 

1 

84 

1 

84 

7056 

2 

75 

4 

150 

5625 

3 

80 

9 

240 

6400 

1 

63 

1 

63 

3969 

1 

78 

1 

78 

6084 

Adding  up  the  numbers  in  each  column  in  Table  10.5  "Processed  Absence  and  Score 
Data"  gives 

Ex  =  71,  Ey  =  2001,  Ex2  =  329,  Exy  =  4758,  and  Ey2  =  16151 


Then 
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ss„  = 


1 


ssxy  = 


n 

1 


(Ex)2  —  329  - (7 1)2  -  135.1153846 

26 


Exz 

Exy  -  -  (Ex)  (Ey)  =  4758  -  (71)  (2001)  =  -706.269230 

m  v  ’  26 


1 


1 


SSyy  =  -  - 


n 


(Ey)  2  =  161511  -  (2001) 2  =  7510.961538 


and 


Ex  71  Ey  2001 

x  —  —  =  —  =  2.730769231  and  y  =  —  =  =  76.96153846 


n  26 


26 


We  begin  the  actual  modelling  by  finding  the  least  squares  regression  line,  the  line 
that  best  fits  the  data.  Its  slope  andy-intercept  are 


P\ 


fio 


SSxy  _  -706.2692308 
SSXX  ~  135.1153846 
x  =  76.96153846 


-5.227156278 

(-5.227156278)  (2.730769231)  =  9 


Rounding  these  numbers  to  two  decimal  places,  the  least  squares  regression  line  for 
these  data  is 


y  =  -5.23x  + 91.24. 

The  goodness  of  fit  of  this  line  to  the  scatter  plot,  the  sum  of  its  squared  errors,  is 

SSE  =  SS^  -  ?i\SSxy  =  7510.961538  -  (-5.227156278)  (-706.26923 

This  number  is  not  particularly  informative  in  itself,  but  we  use  it  to  compute  the 
important  statistic 


Se  = 


/  3819.181 894 

V  24 


12.11988495 


The  statistic  s£  estimates  the  standard  deviation  a  of  the  normal  random  variable  £ 
in  the  model.  Its  meaning  is  that  among  all  students  with  the  same  number  of 
absences,  the  standard  deviation  of  their  scores  on  the  final  exam  is  about  12.1 
points.  Such  a  large  value  on  a  100-point  exam  means  that  the  final  exam  scores  of 
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each  sub-population  of  students,  based  on  the  number  of  absences,  are  highly 
variable. 


The  size  and  sign  of  the  slope  /?  j  =  —5.23  indicate  that,  for  every  class  missed, 
students  tend  to  score  about  5.23  fewer  points  lower  on  the  final  exam  on  average. 
Similarly  for  every  two  classes  missed  students  tend  to  score  on  average 
2  X  5.23  =  10.46fewer  points  on  the  final  exam,  or  about  a  letter  grade  worse  on 
average. 


Since  0  is  in  the  range  of  x-values  in  the  data  set,  the  y-intercept  also  has  meaning 
in  this  problem.  It  is  an  estimate  of  the  average  grade  on  the  final  exam  of  all 
students  who  have  perfect  attendance.  The  predicted  average  of  such  students  is 
%  =  91.24. 

Before  we  use  the  regression  equation  further,  or  perform  other  analyses,  it  would 
be  a  good  idea  to  examine  the  utility  of  the  linear  regression  model.  We  can  do  this 
in  two  ways:  l)  by  computing  the  correlation  coefficient  r  to  see  how  strongly  the 
number  of  absences  x  and  the  score  y  on  the  final  exam  are  correlated,  and  2)  by 
testing  the  null  hypothesis  Hq  \  =0  (the  slope  of  the  population  regression  line 

is  zero,  so  x  is  not  a  good  predictor  ofy)  against  the  natural  alternative  Ha  :  fi\  <  0 
(the  slope  of  the  population  regression  line  is  negative,  so  final  exam  scores  y  go 
down  as  absences  x  go  up). 


The  correlation  coefficient  r  is 


r  = 


xy 


-706.2692308 


y/SSxxSSyy  ^(135.1153846)  (7510.961538) 


=  -0.70108409 


a  moderate  negative  correlation. 


Turning  to  the  test  of  hypotheses,  let  us  test  at  the  commonly  used  5%  level  of 
significance.  The  test  is 


Ho  :  A  =  0 

vs.  Ha  :  /i,  <  0  @a  =  0.05 
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From  Figure  12.3  "Critical  Values  of",  with  df  —  26  —  2  =  24  degrees  of  freedom 
4).05  —  1.71 1,  so  the  rejection  region  is  (  — go,  —  1 .7 1 1]  .  The  value  of  the 
standardized  test  statistic  is 

,=  - _ -5-2271;6278  -0 _ =  -5.013 

se  /  y/SS^  12.11988495/  V135.1153846 

which  falls  in  the  rejection  region.  We  reject  Ho  in  favor  of  Ha.  The  data  provide 
sufficient  evidence,  at  the  5%  level  of  significance,  to  conclude  that  If  is  negative, 
meaning  that  as  the  number  of  absences  increases  average  score  on  the  final  exam 
decreases. 


As  already  noted,  the  value  =  —5.23  gives  a  point  estimate  of  how  much  one 

additional  absence  is  reflected  in  the  average  score  on  the  final  exam.  For  each 

additional  absence  the  average  drops  by  about  5.23  points.  We  can  widen  this  point 

estimate  to  a  confidence  interval  for  (f .  At  the  95%  confidence  level,  from  Figure 

12.3  "Critical  Values  of"  with  df  —  26  —  2  —  24  degrees  of  freedom, 

taj 2  —  Ai025  —  2. 064. The  95%  confidence  interval  for  ff  based  on  our  sample  data 

is 


P 1  ±  ta/ 2 


-5.23  ±  2.064 


12.11988495 
\/ 135. 1 153846 


-5.23  ±2.15 


or  (—7.38,  —3.08)  .We  are  95%  confident  that,  among  all  students  who  ever  take 
this  course,  for  each  additional  class  missed  the  average  score  on  the  final  exam 
goes  down  by  between  3.08  and  7.38  points. 


if  we  restrict  attention  to  the  sub-population  of  all  students  who  have  exactly  five 
absences,  say,  then  using  the  least  squares  regression  equation 
y  —  — 5.23x  +  91 .24  we  estimate  that  the  average  score  on  the  final  exam  for 
those  students  is 


y  =  -5.23  (5)  +91.24  =  65.09 

This  is  also  our  best  guess  as  to  the  score  on  the  final  exam  of  any  particular  student 
who  is  absent  five  times.  A  95%  confidence  interval  for  the  average  score  on  the 
final  exam  for  all  students  with  five  absences  is 
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yP 


±  ta/2Se 


(xp  -x)2 

ssxx 


65.09 

65.09 

65.09 


(2.064)  (12.11988495) 

25 .0 1 544254  ^0.0765727299 
6.92 


which  is  the  interval  (58.  17,72.  01 )  .This  confidence  interval  suggests  that  the 
true  mean  score  on  the  final  exam  for  all  students  who  are  absent  from  class  exactly 
five  times  during  the  semester  is  likely  to  be  between  58.17  and  72.01. 


if  a  particular  student  misses  exactly  five  classes  during  the  semester,  his  score  on 
the  final  exam  is  predicted  with  95%  confidence  to  be  in  the  interval 


/  i  (x  -j) 

yp  ±  ta/2se\\  1  +  -  +  — —  =  65.09  ±  25.01544254^1-0765727 

V  n  55xx 

-  65.09  ±25.96 

which  is  the  interval  (39.  13,91. 05)  .This  prediction  interval  suggests  that  this 
individual  student’s  final  exam  score  is  likely  to  be  between  39.13  and  91.05. 

Whereas  the  95%  confidence  interval  for  the  average  score  of  all  student  with  five 
absences  gave  real  information,  this  interval  is  so  wide  that  it  says  practically 
nothing  about  what  the  individual  student’s  final  exam  score  might  be.  This  is  an 
example  of  the  dramatic  effect  that  the  presence  of  the  extra  summand  1  under  the 
square  sign  in  the  prediction  interval  can  have. 


Finally,  the  proportion  of  the  variability  in  the  scores  of  students  on  the  final  exam 
that  is  explained  by  the  linear  relationship  between  that  score  and  the  number  of 
absences  is  estimated  by  the  coefficient  of  determination,  r2.  Since  we  have  already 
computed  r  above  we  easily  find  that 

r2  =  (-0.7010840977)  2  =  0.491518912 

or  about  49%.  Thus  although  there  is  a  significant  correlation  between  attendance 
and  performance  on  the  final  exam,  and  we  can  estimate  with  fair  accuracy  the 
average  score  of  students  who  miss  a  certain  number  of  classes,  nevertheless  less 
than  half  the  total  variation  of  the  exam  scores  in  the  sample  is  explained  by  the 
number  of  absences.  This  should  not  come  as  a  surprise,  since  there  are  many 
factors  besides  attendance  that  bear  on  student  performance  on  exams. 
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KEY  TAKEAWAY 


•  It  is  a  good  idea  to  attend  class. 


10.8  A  Complete  Example 


625 


Chapter  10  Correlation  and  Regression 


EXERCISES 


The  exercises  in  this  section  are  unrelated  to  those  in  previous  sections. 

1.  The  data  give  the  amount  x  of  silicofluoride  in  the  water  (mg/L)  and  the 
amount  y  of  lead  in  the  bloodstream  (jig/ dL)  of  ten  children  in  various 
communities  with  and  without  municipal  water.  Perform  a  complete  analysis 
of  the  data,  in  analogy  with  the  discussion  in  this  section  (that  is,  make  a 
scatter  plot,  do  preliminary  computations,  find  the  least  squares  regression 
line,  find  SSE  ,  Se,  and  r,  and  so  on).  In  the  hypothesis  test  use  as  the 
alternative  hypothesis  (i\  >  0,  and  test  at  the  5%  level  of  significance.  Use 
confidence  level  95%  for  the  confidence  interval  for  .  Construct  95% 
confidence  and  predictions  intervals  at  Xp  =  2  at  the  end. 


X 

0.0 

0.0 

1.1 

1.4 

1.6 

y 

0.3 

0.1 

4.7 

3.2 

5.1 

X 

1.7 

2.0 

2.0 

2.2 

2.2 

y 

7.0 

5.0 

6.1 

8.6 

9.5 

2.  The  table  gives  the  weight  x  (thousands  of  pounds)  and  available  heat  energy  y 
(million  BTU)  of  a  standard  cord  of  various  species  of  wood  typically  used  for 
heating.  Perform  a  complete  analysis  of  the  data,  in  analogy  with  the 
discussion  in  this  section  (that  is,  make  a  scatter  plot,  do  preliminary 
computations,  find  the  least  squares  regression  line,  find  SSE  ,  S£,  and  r,  and 
so  on).  In  the  hypothesis  test  use  as  the  alternative  hypothesis  /?j  >0,  and 
test  at  the  5%  level  of  significance.  Use  confidence  level  95%  for  the  confidence 
interval  for  f3\  .  Construct  95%  confidence  and  predictions  intervals  at 
Xp  =  5  at  the  end. 


X 

3.37 

3.50 

4.29 

4.00 

4.64 

y 

23.6 

17.5 

20.1 

21.6 

28.1 

X 

4.99 

4.94 

5.48 

3.26 

4.16 

y 

25.3 

27.0 

30.7 

18.9 

20.7 

LARGE  DATA  SET  EXERCISES 


3.  Large  Data  Sets  3  and  3A  list  the  shoe  sizes  and  heights  of  174  customers 
entering  a  shoe  store.  The  gender  of  the  customer  is  not  indicated  in  Large 
Data  Set  3.  However,  men’s  and  women’s  shoes  are  not  measured  on  the  same 
scale;  for  example,  a  size  8  shoe  for  men  is  not  the  same  size  as  a  size  8  shoe  for 
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women.  Thus  it  would  not  be  meaningful  to  apply  regression  analysis  to  Large 
Data  Set  3.  Nevertheless,  compute  the  scatter  diagrams,  with  shoe  size  as  the 
independent  variable  (x)  and  height  as  the  dependent  variable  (y),  for  (i)  just 
the  data  on  men,  (ii)  just  the  data  on  women,  and  (iii)  the  full  mixed  data  set 
with  both  men  and  women.  Does  the  third,  invalid  scatter  diagram  look 
markedly  different  from  the  other  two? 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data3.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data3A.xls 

4.  Separate  out  from  Large  Data  Set  3A  just  the  data  on  men  and  do  a  complete 
analysis,  with  shoe  size  as  the  independent  variable  (x)  and  height  as  the 
dependent  variable  (y).  Use  a  =  0.05  andx^,  =  10  whenever  appropriate. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data3A.xls 

5.  Separate  out  from  Large  Data  Set  3A  just  the  data  on  women  and  do  a  complete 
analysis,  with  shoe  size  as  the  independent  variable  (x)  and  height  as  the 
dependent  variable  (y).  Use  a  =  0.05  andx^  =  1 0  whenever  appropriate. 

http://www.gone.2012books.lardbucket.org/sites/all/files/data3A.xls 
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ANSWERS 


i.  Ex  =  14.2, Icy  =  49.6,  Exy  =  91.73  ,Ex2  -  26.3, 

E_y2  =  333.86. 

SSXX  =6.136,^  =  21.298  ,SSyy  =  87.844. 

J  =  1 .42,7  =  4.96. 

'Pi  =  3.47,  p0  =  0.03. 

SSE  =  13.92. 
se  =  1.32. 

r  =  0.9174,  r2  =  0.8416. 
df  =  8,  T=  6.518. 

The  95%  confidence  interval  for  /?j  is:  (2.  24,4.  70)  . 

At  Xp  =  2,  the  95%  confidence  interval  for  E  (y)  is  (5.  77,8.  17)  . 

At  Xp  =  2,  the  95%  prediction  interval  for  y  is  (3.  73,10.  21)  . 

3.  The  positively  correlated  trend  seems  less  profound  than  that  in  each  of  the 
previous  plots. 

5.  The  regression  line:  y  =  3. 3426x  +  138.7692.  Coefficient  of 
Correlation:  r  =  0.9431.  Coefficient  of  Determination:  r2  =  0.8894. 

SSE  =  283.2473.  Se  =  1.9305.  A  95%  confidence  interval  for  : 
(3.0733,3.6120).  Test  Statistic  for  Hq  \  /7,  =  Q  T=  24.7209.  At 
Xp  =  10,  _y  =  172.1956  ;  a  95%  confidence  interval  for  the  mean  value  of 
y  is:  (171.5577,172.8335)  ;  and  a  95%  prediction  interval  for  an 
individual  value  of y  is:  (168.2974,176.0938)  . 
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10.9  Formula  List 


SS„  =  Ex2  -  -  (Ex)2  SS„  =  Exy  -  -  (Ex)  (E y)  SS„ 

n  •  n  v  ' 


=  2/ 


Correlation  coefficient: 


ss 


r  = 


xy 


y/ss7x  • 


Least  squares  regression  equation  (equation  of  the  least  squares  regression  line): 


y  —  'pyx  +  >'o  where  //]  = 


xy 


55 


and  p0  =  y  -  p yx 


XX 


Sum  of  the  squared  errors  for  the  least  squares  regression  line: 


SSE  =  SSyy  -  p  |  55 


'xy 


Sample  standard  deviation  of  errors: 


w  = 


SSE 
n— 2 


100  (1  —  a)  %  confidence  interval  for  / 3y : 


P 1  ±  ta/2 


y/SS^ 


(< df  =  n- 2) 


Standardized  test  statistic  for  hypothesis  tests  concerning  : 


T  =  —  (df  =  n-  2) 


Coefficient  of  determination: 
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,  _  SS„  -  SSE  _  SS%  _  ^  SSly 
r  -  SS„  -  SS„SS„  -  H  ss„ 

100  (1  —  a)  %  confidence  interval  for  the  mean  value  of  y  at  x  =  xp: 


?p  ±  ta/2  Se 


(xp  -xf 

SSXX 


=  n— 


2) 


100  (1  —  a)  %  prediction  interval  for  an  individual  new  value  of  y  atx  =  xp: 


1  (x  —  X ) 

?p  ±  ta/2  se\l  1  +  -  +  — p— -  (df  =  n- 2) 

Tl  JJ  \ 


1  xx 
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Chi-Square  Tests  and  F-Tests 


In  previous  chapters  you  saw  how  to  test  hypotheses  concerning  population  means 
and  population  proportions.  The  idea  of  testing  hypotheses  can  be  extended  to 
many  other  situations  that  involve  different  parameters  and  use  different  test 
statistics.  Whereas  the  standardized  test  statistics  that  appeared  in  earlier  chapters 
followed  either  a  normal  or  Student  t-distribution,  in  this  chapter  the  tests  will 
involve  two  other  very  common  and  useful  distributions,  the  chi-square  and  the  F- 
distributions.  The  chi-square  distribution1  arises  in  tests  of  hypotheses 
concerning  the  independence  of  two  random  variables  and  concerning  whether  a 
discrete  random  variable  follows  a  specified  distribution.  The  F-distribution2  arises 
in  tests  of  hypotheses  concerning  whether  or  not  two  population  variances  are 
equal  and  concerning  whether  or  not  three  or  more  population  means  are  equal. 


1.  A  particular  probability 
distribution  specified  by  a 
number  of  degrees  of  freedom, 

df. 

2.  A  particular  probability 
distribution  specified  by  two 
degrees  of  freedom,  df  j  and 

df2. 
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11.1  Chi-Square  Tests  for  Independence 


LEARNING  OBJECTIVES 

1.  To  understand  what  chi-square  distributions  are. 

2.  To  understand  how  to  use  a  chi-square  test  to  judge  whether  two  factors 
are  independent. 


Chi-Square  Distributions 

As  you  know,  there  is  a  whole  family  of  t-distributions,  each  one  specified  by  a 
parameter  called  the  degrees  of  freedom,  denoted  df.  Similarly,  all  the  chi-square 
distributions  form  a  family,  and  each  of  its  members  is  also  specified  by  a  parameter 
df,  the  number  of  degrees  of  freedom.  Chi  is  a  Greek  letter  denoted  by  the  symbol  / 
and  chi-square  is  often  denoted  by  %2 .  Figure  11.1  "Many  "  shows  several  chi- 
square  distributions  for  different  degrees  of  freedom.  A  chi-square  random 
variable3  is  a  random  variable  that  assumes  only  positive  values  and  follows  a  chi- 
square  distribution. 


3.  A  random  variable  that  follows 
a  chi-square  distribution. 


2 

Figure  11.1  Many X  Distributions 
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Figure  12.4  "Critical  Values  of  Chi-Square  Distributions"  gives  values  of  /c  for 
various  values  of  c  and  under  several  chi-square  distributions  with  various  degrees 
of  freedom. 

Tests  for  Independence 

Hypotheses  tests  encountered  earlier  in  the  book  had  to  do  with  how  the  numerical 
values  of  two  population  parameters  compared.  In  this  subsection  we  will 
investigate  hypotheses  that  have  to  do  with  whether  or  not  two  random  variables 
take  their  values  independently,  or  whether  the  value  of  one  has  a  relation  to  the 
value  of  the  other.  Thus  the  hypotheses  will  be  expressed  in  words,  not 
mathematical  symbols.  We  build  the  discussion  around  the  following  example. 


There  is  a  theory  that  the  gender  of  a  baby  in  the  womb  is  related  to  the  baby’s 
heart  rate:  baby  girls  tend  to  have  higher  heart  rates.  Suppose  we  wish  to  test  this 
theory.  We  examine  the  heart  rate  records  of  40  babies  taken  during  their  mothers’ 
last  prenatal  checkups  before  delivery,  and  to  each  of  these  40  randomly  selected 
records  we  compute  the  values  of  two  random  measures:  l)  gender  and  2)  heart 
rate.  In  this  context  these  two  random  measures  are  often  called  factors4.  Since  the 
burden  of  proof  is  that  heart  rate  and  gender  are  related,  not  that  they  are 
unrelated,  the  problem  of  testing  the  theory  on  baby  gender  and  heart  rate  can  be 
formulated  as  a  test  of  the  following  hypotheses: 

4.  A  variable  with  several 
qualitative  levels. 
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Ho  :  Baby  gender  and  baby  heart  rate  are  independent 
vs.  Ha  :  Baby  gender  and  baby  heart  rate  are  not  independent 

The  factor  gender  has  two  natural  categories  or  levels:  boy  and  girl.  We  divide  the 
second  factor,  heart  rate,  into  two  levels,  low  and  high,  by  choosing  some  heart 
rate,  say  145  beats  per  minute,  as  the  cutoff  between  them.  A  heart  rate  below  145 
beats  per  minute  will  be  considered  low  and  145  and  above  considered  high.  The  40 
records  give  rise  to  a  2  x  2  contingency  table.  By  adjoining  row  totals,  column  totals, 
and  a  grand  total  we  obtain  the  table  shown  as  Table  11.1  "Baby  Gender  and  Heart 
Rate".  The  four  entries  in  boldface  type  are  counts  of  observations  from  the  sample 
of  n  =  40.  There  were  11  girls  with  low  heart  rate,  17  boys  with  low  heart  rate,  and  so 
on.  They  form  the  core  of  the  expanded  table. 


Table  11.1  Baby  Gender  and  Heart  Rate 


Heart  Rate 

Low 

High 

Row  Total 

Gender 

Girl 

11 

7 

18 

Boy 

17 

5 

22 

Column  Total 

28 

12 

Total  =  40 

In  analogy  with  the  fact  that  the  probability  of  independent  events  is  the  product  of 
the  probabilities  of  each  event,  if  heart  rate  and  gender  were  independent  then  we 
would  expect  the  number  in  each  core  cell  to  be  close  to  the  product  of  the  row 
total  R  and  column  total  C  of  the  row  and  column  containing  it,  divided  by  the 
sample  size  n.  Denoting  such  an  expected  number  of  observations  £,  these  four 
expected  values  are: 


•  1st  row  and  1st  column:  E  —  (Rx  C)  /  n  —  18  X  28  /  40  =  12.6 

•  1st  row  and  2nd  column:  E  —  (f?xC)/ft  =  18xl2/40  =  5.4 

•  2nd  row  and  1st  column:  E  —  (R  X  C)  /  n  —  22  X  28  /  40  =  15.4 

•  2nd  row  and  2nd  column:  E  —  (R  X  C)  /  n  —  22  X  12  /  40  =  6.6 

We  update  Table  11.1  "Baby  Gender  and  Heart  Rate"  by  placing  each  expected  value 
in  its  corresponding  core  cell,  right  under  the  observed  value  in  the  cell.  This  gives 
the  updated  table  Table  11.2  "Updated  Baby  Gender  and  Heart  Rate". 
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Table  11.2  Updated  Baby  Gender  and  Heart  Rate 


Heart  Rate 

Low 

High 

Row  Total 

Gender 

Girl 

0=11 

E  =  12.6 

0  =  1 

E  =  5.4 

R=  18 

Boy 

0=11 

E  =  15.4 

0  =  5 

E  =  6.6 

R  =  22 

Column  Total 

C  =  28 

C=  12 

n  =  40 

A  measure  of  how  much  the  data  deviate  from  what  we  would  expect  to  see  if  the 
factors  really  were  independent  is  the  sum  of  the  squares  of  the  difference  of  the 
numbers  in  each  core  cell,  or,  standardizing  by  dividing  each  square  by  the 

expected  number  in  the  cell,  the  sum  £(0  —  E )2  /  E.  We  would  reject  the  null 
hypothesis  that  the  factors  are  independent  only  if  this  number  is  large,  so  the  test 
is  right-tailed.  In  this  example  the  random  variable  E(0  —  E )2  /  if  has  the  chi- 
square  distribution  with  one  degree  of  freedom,  if  we  had  decided  at  the  outset  to 
test  at  the  10%  level  of  significance,  the  critical  value  defining  the  rejection  region 
would  be,  reading  from  Figure  12.4  "Critical  Values  of  Chi-Square  Distributions", 

Xa  —  Xq  io  =  2.706,  so  that  the  rejection  region  would  be  the  interval  [2.706,  co)  . 
When  we  compute  the  value  of  the  standardized  test  statistic  we  obtain 


(, O-E f  (11 -12.6)  2  (7-5.4)2  (17 -15.4) 2  (5-6.6 


12.6 


+ 


5.4 


+ 


15.4 


+ 


6.6 


Since  1.231  <  2.706,  the  decision  is  not  to  reject  Ho.  See  Figure  11.3  "Baby  Gender 
Prediction".  The  data  do  not  provide  sufficient  evidence,  at  the  10%  level  of 
significance,  to  conclude  that  heart  rate  and  gender  are  related. 
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Figure  11.3  Baby  Gender  Prediction 


With  this  specific  example  in  mind,  now  turn  to  the  general  situation.  In  the  general 
setting  of  testing  the  independence  of  two  factors,  call  them  Factor  1  and  Factor  2, 
the  hypotheses  to  be  tested  are 

No  :  The  two  factors  are  independent 
vs.  Ha  :  The  two  factors  are  not  independent 

As  in  the  example  each  factor  is  divided  into  a  number  of  categories  or  levels.  These 
could  arise  naturally,  as  in  the  boy-girl  division  of  gender,  or  somewhat  arbitrarily, 
as  in  the  high-low  division  of  heart  rate.  Suppose  Factor  1  has  I  levels  and  Factor  2 
has  J  levels.  Then  the  information  from  a  random  sample  gives  rise  to  a  general  J  *  J 
contingency  table,  which  with  row  totals,  column  totals,  and  a  grand  total  would 
appear  as  shown  in  Table  11.3  "General  Contingency  Table".  Each  cell  may  be 
labeled  by  a  pair  of  indices  \i,j)  ■  0\j  stands  for  the  observed  count  of  observations 
in  the  cell  in  row  i  and  column  j,  R,  for  the  ith  row  total  and  Cj  for  the  jth  column 
total.  To  simplify  the  notation  we  will  drop  the  indices  so  Table  11.3  "General 
Contingency  Table"  becomes  Table  11.4  "Simplified  General  Contingency  Table". 
Nevertheless  it  is  important  to  keep  in  mind  that  the  Os,  the  Rs  and  the  Cs,  though 
denoted  by  the  same  symbols,  are  in  fact  different  numbers. 
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Table  11.3  General  Contingency  Table 


Factor  2  Levels 

1 

j 

J 

Row  Total 

Factor  1  Levels 

1 

On 

Oij 

Ou 

Ri 

i 

On 

Oij 

Ou 

Ri 

i 

On 

Oij 

Ou 

Ri 

Column  Total 

Ci 

Cj 

Cj 

n 

Table  11.4  Simplified  General  Contingency  Table 


Factor  2  Levels 

1 

j 

J 

Row  Total 

Factor  1  Levels 

1 

0 

0 

0 

R 

i 

0 

0 

0 

R 

i 

0 

0 

0 

R 

Column  Total 

c 

c 

c 

n 

As  in  the  example,  for  each  core  cell  in  the  table  we  compute  what  would  be  the 
expected  number  E  of  observations  if  the  two  factors  were  independent.  E  is 
computed  for  each  core  cell  (each  cell  with  an  0  in  it)  of  Table  11.4  "Simplified 
General  Contingency  Table"  by  the  rule  applied  in  the  example: 
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E  = 


RxC 

n 


where  R  is  the  row  total  and  C  is  the  column  total  corresponding  to  the  cell,  and 
n  is  the  sample  size. 


After  the  expected  number  is  computed  for  every  cell,  Table  11.4  "Simplified 
General  Contingency  Table"  is  updated  to  form  Table  11.5  "Updated  General 
Contingency  Table"  by  inserting  the  computed  value  of  E  into  each  core  cell. 


Table  11.5  Updated  General  Contingency  Table 


Factor  2  Levels 

1 

j 

J 

Row  Total 

Factor  1  Levels 

1 

0 

E 

0 

E 

0 

E 

R 

i 

0 

E 

0 

E 

0 

E 

R 

i 

0 

E 

0 

E 

0 

E 

R 

Column  Total 

c 

c 

c 

n 

Here  is  the  test  statistic  for  the  general  hypothesis  based  on  Table  11.5  "Updated 
General  Contingency  Table",  together  with  the  conditions  that  it  follow  a  chi-square 
distribution. 
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Test  Statistic  for  Testing  the  Independence  of  Two 
Factors 


X 


2 


=  X 


(O  -  Ef 
E 


where  the  sum  is  over  all  core  cells  of  the  table. 


If 


1.  the  two  study  factors  are  independent,  and 

2.  the  observed  count  0  of  each  cell  in  Table  11.5  "Updated  General 
Contingency  Table"  is  at  least  5, 


then  j2  approximately  follows  a  chi-square  distribution  with 
df  —  (/— 1)  X  (/— 1  degrees  of  freedom. 


The  same  five-step  procedures,  either  the  critical  value  approach  or  the  p-value 
approach,  that  were  introduced  in  Section  8.1  "The  Elements  of  Hypothesis  Testing" 
and  Section  8.3  "The  Observed  Significance  of  a  Test"  of  Chapter  8  "Testing 
Hypotheses"  are  used  to  perform  the  test,  which  is  always  right -tailed. 
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EXAMPLE  1 


A  researcher  wishes  to  investigate  whether  students’  scores  on  a  college 
entrance  examination  (CEE)  have  any  indicative  power  for  future  college 
performance  as  measured  by  GPA.  In  other  words,  he  wishes  to  investigate 
whether  the  factors  CEE  and  GPA  are  independent  or  not.  He  randomly 
selects  n  =  100  students  in  a  college  and  notes  each  student’s  score  on  the 
entrance  examination  and  his  grade  point  average  at  the  end  of  the 
sophomore  year.  He  divides  entrance  exam  scores  into  two  levels  and  grade 
point  averages  into  three  levels.  Sorting  the  data  according  to  these 
divisions,  he  forms  the  contingency  table  shown  as  Table  11.6  "CEE  versus 
GPA  Contingency  Table",  in  which  the  row  and  column  totals  have  already 
been  computed. 


TABLE  11.6  CEE  VERSUS  GPA  CONTINGENCY  TABLE 


GPA 

<2.7 

2.7  to  3.2 

>3.2 

Row  Total 

CEE 

<  1800 

35 

12 

5 

52 

>  1800 

6 

24 

18 

48 

Column  Total 

41 

36 

23 

Total  =  100 

Test,  at  the  1%  level  of  significance,  whether  these  data  provide  sufficient 
evidence  to  conclude  that  CEE  scores  indicate  future  performance  levels  of 
incoming  college  freshmen  as  measured  by  GPA. 

Solution: 

We  perform  the  test  using  the  critical  value  approach,  following  the  usual 
five-step  method  outlined  at  the  end  of  Section  8.1  "The  Elements  of 
Hypothesis  Testing"  in  Chapter  8  "Testing  Hypotheses". 

•  Step  1.  The  hypotheses  are 

Hq  :  CEE  and  GPA  are  independent  factors 
vs.  Ha  :  CEE  and  GPA  are  not  independent  factors 
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•  Step  2.  The  distribution  is  chi-square. 

•  Step  3.  To  compute  the  value  of  the  test  statistic  we  must  first 
computed  the  expected  number  for  each  of  the  six  core  cells  (the 
ones  whose  entries  are  boldface): 


°  1st  row  and  1st  column: 

E  =  (R  x  C)  /  n  =  41  x  52  /  100  =  21.32 

°  1st  row  and  2nd  column: 

E  =  (RxC)  /  n  =  36  X  52  /  100  =  18.72 

°  1st  row  and  3rd  column: 

E  =  (RxC)  /  n  =  23  X  52  /  100  =  11.96 

°  2nd  row  and  1st  column: 

E  =  (RxC)  /  n  =  41  x  48  /  100  =  19.68 

°  2nd  row  and  2nd  column: 

E  =  (RxC)  /  n  =  36  X  48  /  100  =  17.28 

°  2nd  row  and  3rd  column: 

E  =  (R  x  C)  /  n  =  23  X  48  /  100  =  11.04 


Table  11.6  "CEE  versus  GPA  Contingency  Table"  is  updated  to 
Table  11.7  "Updated  CEE  versus  GPA  Contingency  Table". 


TABLE  11.7  UPDATED  CEE  VERSUS  GPA 
CONTINGENCY  TABLE 


GPA 

<2.7 

2.7  to  3.2 

>3.2 

Row  Total 

CEE 

<  1800 

O  =  35 

E  =  21.32 

O  =  12 

E  =  18.72 

0  =  5 

E  =  11.96 

R=  52 

>  1800 

0  —  6 

E  =  19.68 

O  =  24 

E  =  17.28 

O  =  18 

E  =  11.04 

00 

II 

Ql 

Column  Total 

C  =  41 

n 

n 

GO 

ON 

C  =  23 

n  =  100 

The  test  statistic  is 
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2  =  z(0-  E)2 


E 


(35  -  21.32) 2  (12  -  18.72) 2  (5  -  11.96)2 

21.32  +  18.72  +  11.96 


(6  -  19.68) 2  _  (24  -  17.28) 2  (  (18  —  1 1.04) 2 
19.68  +  17.28  +  11.04 


=  31.75 


•  Step  4.  Since  the  CEE  factor  has  two  levels  and  the  GPA  factor  has 
three,  1  =  2  and  J  =  3.  Thus  the  test  statistic  follows  the  chi-square 
distribution  with  df  =  (2  —  1)  X  (3  —  1)  =  2degrees  of 
freedom. 


Since  the  test  is  right-tailed,  the  critical  value  isj^Q  oi  •  Reading 
from  Figure  12.4  "Critical  Values  of  Chi-Square  Distributions", 

Zqqi  =  9.210,  so  the  rejection  region  is  [9.210,  oo)  . 


•  Step  5.  Since  31.75  >  9.21  the  decision  is  to  reject  the  null  hypothesis.  See 
Figure  11.4.  The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  CEE  score  and  GPA  are  not  independent: 
the  entrance  exam  score  has  predictive  power. 

Figure  11.4 

Note  11.9  "Example  1" 


X 


,2 


0 


=  9-210 


Reject  H0 


XT2  =31.75 
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KEY  TAKEAWAYS 


•  Critical  values  of  a  chi-square  distribution  with  degrees  of  freedom  df 
are  found  in  Figure  12.4  "Critical  Values  of  Chi-Square  Distributions". 

•  A  chi-square  test5  can  be  used  to  evaluate  the  hypothesis  that  two 
random  variables  or  factors  are  independent. 


5.  A  test  based  on  a  chi-square 
statistic  to  check  whether  two 
factors  are  independent. 
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1.  Find^Q  for  each  of  the  following  number  of  degrees  of  freedom. 

a.  (if  =  5 

b.  df  =  11 

c.  df  =  25 

2-  Find/005  for  each  of  the  following  number  of  degrees  of  freedom. 

a.  df  =  6 

b.  df  =  12 

c.  df  =  30 

3-  Findj0_10  for  each  of  the  following  number  of  degrees  of  freedom. 

a.  df  =  6 

b.  df  =  12 

c.  df  =  30 

4.  Find^Q  for  each  of  the  following  number  of  degrees  of  freedom. 

a.  df  =  1 

b.  df  =  10 
C.  df  =  20 

5.  For  df  =  7  and  Ct  =  0.05,  find 

a-  Xl 

b-  Xl 

2 

6.  For  df  =  17  and  Ct  =  0.01,  find 

a- 

b- 

2 

7.  A  data  sample  is  sorted  into  a  2  *  2  contingency  table  based  on  two  factors, 
each  of  which  has  two  levels. 
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Factor  1 

Level  1 

Level  2 

Row  Total 

Factor  2 

Level  1 

20 

10 

R 

Level  2 

15 

5 

R 

Column  Total 

c 

c 

n 

a.  Find  the  column  totals,  the  row  totals,  and  the  grand  total,  n,  of  the  table. 

b.  Find  the  expected  number  E  of  observations  for  each  cell  based  on  the 
assumption  that  the  two  factors  are  independent  (that  is,  just  use  the 
formula  E  =  (R  X  C)  /  n). 

c.  Find  the  value  of  the  chi-square  test  statistic^  . 

d.  Find  the  number  of  degrees  of  freedom  of  the  chi-square  test  statistic. 

8.  A  data  sample  is  sorted  into  a  3  *  2  contingency  table  based  on  two  factors,  one 
of  which  has  three  levels  and  the  other  of  which  has  two  levels. 


Factor  1 

Level  1 

Level  2 

Row  Total 

Factor  2 

Level  1 

20 

10 

R 

Level  2 

15 

5 

R 

Level  3 

10 

20 

R 

Column  Total 

C 

C 

n 

a.  Find  the  column  totals,  the  row  totals,  and  the  grand  total,  n,  of  the  table. 

b.  Find  the  expected  number  E  of  observations  for  each  cell  based  on  the 
assumption  that  the  two  factors  are  independent  (that  is,  just  use  the 
formula  E  =  (R  X  C)  /  n). 

c.  Find  the  value  of  the  chi-square  test  statistic^  . 

d.  Find  the  number  of  degrees  of  freedom  of  the  chi-square  test  statistic. 


APPLICATIONS 


9.  A  child  psychologist  believes  that  children  perform  better  on  tests  when  they 
are  given  perceived  freedom  of  choice.  To  test  this  belief,  the  psychologist 
carried  out  an  experiment  in  which  200  third  graders  were  randomly  assigned 
to  two  groups,  A  and  B.  Each  child  was  given  the  same  simple  logic  test. 
However  in  group  B,  each  child  was  given  the  freedom  to  choose  a  text  booklet 
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from  many  with  various  drawings  on  the  covers.  The  performance  of  each 
child  was  rated  as  Very  Good,  Good,  and  Fair.  The  results  are  summarized  in 
the  table  provided.  Test,  at  the  5%  level  of  significance,  whether  there  is 
sufficient  evidence  in  the  data  to  support  the  psychologist’s  belief. 


Group 

A 

B 

Very  Good 

32 

29 

Performance 

Good 

55 

61 

Fair 

10 

13 

10.  In  regard  to  wine  tasting  competitions,  many  experts  claim  that  the  first  glass 
of  wine  served  sets  a  reference  taste  and  that  a  different  reference  wine  may 
alter  the  relative  ranking  of  the  other  wines  in  competition.  To  test  this  claim, 
three  wines,  A,  B  and  C,  were  served  at  a  wine  tasting  event.  Each  person  was 
served  a  single  glass  of  each  wine,  but  in  different  orders  for  different  guests. 
At  the  close,  each  person  was  asked  to  name  the  best  of  the  three.  One  hundred 
seventy-two  people  were  at  the  event  and  their  top  picks  are  given  in  the  table 
provided.  Test,  at  the  1%  level  of  significance,  whether  there  is  sufficient 
evidence  in  the  data  to  support  the  claim  that  wine  experts’  preference  is 
dependent  on  the  first  served  wine. 


Top  Pick 

A 

B 

C 

A 

12 

31 

27 

First  Glass 

B 

15 

40 

21 

C 

10 

9 

7 

11.  Is  being  left-handed  hereditary?  To  answer  this  question,  250  adults  are 
randomly  selected  and  their  handedness  and  their  parents’  handedness  are 
noted.  The  results  are  summarized  in  the  table  provided.  Test,  at  the  1%  level 
of  significance,  whether  there  is  sufficient  evidence  in  the  data  to  conclude 
that  there  is  a  hereditary  element  in  handedness. 


Number  of  Parents  Left-Handed 

0 

1 

2 

Handedness 

Left 

8 

10 

12 

Right 

178 

21 

21 
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12.  Some  geneticists  claim  that  the  genes  that  determine  left-handedness  also 
govern  development  of  the  language  centers  of  the  brain,  if  this  claim  is  true, 
then  it  would  be  reasonable  to  expect  that  left-handed  people  tend  to  have 
stronger  language  abilities.  A  study  designed  to  text  this  claim  randomly 
selected  807  students  who  took  the  Graduate  Record  Examination  (GRE).  Their 
scores  on  the  language  portion  of  the  examination  were  classified  into  three 
categories:  low,  average,  and  high,  and  their  handedness  was  also  noted.  The 
results  are  given  in  the  table  provided.  Test,  at  the  5%  level  of  significance, 
whether  there  is  sufficient  evidence  in  the  data  to  conclude  that  left-handed 
people  tend  to  have  stronger  language  abilities. 


GRE  English  Scores 

Low 

Average 

High 

Handedness 

Left 

18 

40 

22 

Right 

201 

360 

166 

13.  It  is  generally  believed  that  children  brought  up  in  stable  families  tend  to  do 
well  in  school.  To  verify  such  a  belief,  a  social  scientist  examined  290  randomly 
selected  students’  records  in  a  public  high  school  and  noted  each  student’s 
family  structure  and  academic  status  four  years  after  entering  high  school.  The 
data  were  then  sorted  into  a  2  =<  3  contingency  table  with  two  factors.  Factor  1 
has  two  levels:  graduated  and  did  not  graduate.  Factor  2  has  three  levels:  no 
parent,  one  parent,  and  two  parents.  The  results  are  given  in  the  table  provided. 
Test,  at  the  1%  level  of  significance,  whether  there  is  sufficient  evidence  in  the 
data  to  conclude  that  family  structure  matters  in  school  performance  of  the 
students. 


Academic  Status 

Graduated 

Did  Not  Graduate 

No  parent 

18 

31 

Family 

One  parent 

101 

44 

Two  parents 

70 

26 

14.  A  large  middle  school  administrator  wishes  to  use  celebrity  influence  to 
encourage  students  to  make  healthier  choices  in  the  school  cafeteria.  The 
cafeteria  is  situated  at  the  center  of  an  open  space.  Everyday  at  lunch  time 
students  get  their  lunch  and  a  drink  in  three  separate  lines  leading  to  three 
separate  serving  stations.  As  an  experiment,  the  school  administrator 
displayed  a  poster  of  a  popular  teen  pop  star  drinking  milk  at  each  of  the  three 
areas  where  drinks  are  provided,  except  the  milk  in  the  poster  is  different  at 
each  location:  one  shows  white  milk,  one  shows  strawberry-flavored  pink  milk, 
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and  one  shows  chocolate  milk.  After  the  first  day  of  the  experiment  the 
administrator  noted  the  students’  milk  choices  separately  for  the  three  lines. 
The  data  are  given  in  the  table  provided.  Test,  at  the  1%  level  of  significance, 
whether  there  is  sufficient  evidence  in  the  data  to  conclude  that  the  posters 
had  some  impact  on  the  students’  drink  choices. 


Student  Choice 

Regular 

Strawberry 

Chocolate 

Poster  Choice 

Regular 

38 

28 

40 

Strawberry 

18 

51 

24 

Chocolate 

32 

32 

53 

LARGE  DATA  SET  EXERCISE 


15.  Large  Data  Set  8  records  the  result  of  a  survey  of  300  randomly  selected  adults 
who  go  to  movie  theaters  regularly.  For  each  person  the  gender  and  preferred 
type  of  movie  were  recorded.  Test,  at  the  5%  level  of  significance,  whether 
there  is  sufficient  evidence  in  the  data  to  conclude  that  the  factors  “gender” 
and  “preferred  type  of  movie”  are  dependent. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data8.xls 
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ANSWERS 


1.  a.  15.09, 

b.  24.72, 

c.  44.31 

3.  a.  10.64, 

b.  18.55, 

c.  40.26 

5.  a.  14.07, 

b.  16.01 

7.  a.  C i  =  35,  C2  =  15,  1  —  30, 7? 2  —  20,  n  =  50, 

b.  E\\  =  21,^12  =  9, £21  =  14, E22  =  6, 

C.  /2  =  0.3968, 

d.  df  =  1 

9.  j)f2  =  0.6698  ,^oQ5  —  5.99,  do  not  reject  Ho 
11.  X 2  -  72.35  ,^0  01  =  9.21,  reject  Ho 
13.  x2  =  21.2784  ,j201  =  9.21,  reject  Ho 

15.  ^2  =  28.4539.  df  =  3.  Rejection  Region:  [7.815,  00)  .  Decision:  Reject 
H 0  of  independence. 
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11.2  Chi-Square  One-Sample  Goodness-of-Fit  Tests 


LEARNING  OBJECTIVE 

1.  To  understand  how  to  use  a  chi-square  test  to  judge  whether  a  sample 
fits  a  particular  population  well. 


Suppose  we  wish  to  determine  if  an  ordinary-looking  six-sided  die  is  fair,  or 
balanced,  meaning  that  every  face  has  probability  1/ 6  of  landing  on  top  when  the 
die  is  tossed.  We  could  toss  the  die  dozens,  maybe  hundreds,  of  times  and  compare 
the  actual  number  of  times  each  face  landed  on  top  to  the  expected  number,  which 
would  be  1/ 6  of  the  total  number  of  tosses.  We  wouldn’t  expect  each  number  to  be 
exactly  l/6  of  the  total,  but  it  should  be  close.  To  be  specific,  suppose  the  die  is 
tossed  n  =  60  times  with  the  results  summarized  in  Table  11.8  "Die  Contingency 
Table".  For  ease  of  reference  we  add  a  column  of  expected  frequencies,  which  in 
this  simple  example  is  simply  a  column  of  10s.  The  result  is  shown  as  Table  11.9 
"Updated  Die  Contingency  Table".  In  analogy  with  the  previous  section  we  call  this 
an  “updated”  table.  A  measure  of  how  much  the  data  deviate  from  what  we  would 
expect  to  see  if  the  die  really  were  fair  is  the  sum  of  the  squares  of  the  differences 
between  the  observed  frequency  0  and  the  expected  frequency  E  in  each  row,  or, 
standardizing  by  dividing  each  square  by  the  expected  number,  the  sum 
Z(0  —  E )2  /  E.  If  we  formulate  the  investigation  as  a  test  of  hypotheses,  the  test  is 

Ho  :  The  die  is  fair 
vs  .Ha  :  The  die  is  not  fair 

Table  11.8  Die  Contingency  Table 


Die  Value 

Assumed  Distribution 

Observed  Frequency 

1 

1/6 

9 

2 

1/6 

15 

3 

1/6 

9 

4 

1/6 

8 

5 

1/6 

6 

6 

1/6 

13 
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Table  11.9  Updated  Die  Contingency  Table 


Die  Value 

Assumed  Distribution 

Observed  Freq. 

Expected  Freq. 

1 

1/6 

9 

10 

2 

1/6 

15 

10 

3 

1/6 

9 

10 

4 

1/6 

8 

10 

5 

1/6 

6 

10 

6 

1/6 

13 

10 

We  would  reject  the  null  hypothesis  that  the  die  is  fair  only  if  the  number 
E(6>  -E)2  /  E  is  large,  so  the  test  is  right-tailed.  In  this  example  the  random 
variable  H(0  —  E)2  /  if  has  the  chi-square  distribution  with  five  degrees  of 
freedom,  if  we  had  decided  at  the  outset  to  test  at  the  10%  level  of  significance,  the 
critical  value  defining  the  rejection  region  would  be,  reading  from  Figure  12.4 
"Critical  Values  of  Chi-Square  Distributions",  X2  —  Xq  io  ~  9.236,  so  that  the 
rejection  region  would  be  the  interval  [9.236,  co)  .  When  we  compute  the  value  of 
the  standardized  test  statistic  using  the  numbers  in  the  last  two  columns  of  Table 
11.9  "Updated  Die  Contingency  Table",  we  obtain 


z(°-£)2 

E 

(-1)2  52  (-1)2  (-2)2 

-  - — —  +  —  +  - — —  +  5 — —  ■ 

10  10  10  10 

=  0.1  +  2.5  +  0.1  +0.4  +1.6  +  0.9 
=  5.6 


(-4  )2 

10 


Since  5.6  <  9.236  the  decision  is  not  to  reject  Ho.  See  Figure  11.5  "Balanced  Die".  The 
data  do  not  provide  sufficient  evidence,  at  the  10%  level  of  significance,  to  conclude 
that  the  die  is  loaded. 
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Figure  11.5  Balanced  Die 


In  the  general  situation  we  consider  a  discrete  random  variable  that  can  take  I 
different  values,  X\ , x2,  ■■■  ,  X/,  for  which  the  default  assumption  is  that  the 
probability  distribution  is 


X 

X\ 

X2  . 

..  X/ 

P(X) 

Pi 

P2  • 

..  Pl 

We  wish  to  test  the  hypotheses 

Ho  :  The  assumed  probability  distribution  for  X  is  valid 
vs .Ha  :  The  assumed  probability  distribution  for  X  is  not  valid 

We  take  a  sample  of  size  n  and  obtain  a  list  of  observed  frequencies.  This  is  shown  in 
Table  11.10  "General  Contingency  Table".  Based  on  the  assumed  probability 
distribution  we  also  have  a  list  of  assumed  frequencies,  each  of  which  is  defined  and 
computed  by  the  formula 


Ei  =  nx  Pi 
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Table  11.10  General  Contingency  Table 


Factor  Levels 

Assumed  Distribution 

Observed  Frequency 

1 

Pi 

Oi 

2 

P2 

o2 

I 

PI 

Oi 

Table  11.10  "General  Contingency  Table"  is  updated  to  Table  11.11  "Updated 
General  Contingency  Table"  by  adding  the  expected  frequency  for  each  value  of  X. 
To  simplify  the  notation  we  drop  indices  for  the  observed  and  expected  frequencies 
and  represent  Table  11.11  "Updated  General  Contingency  Table"  by  Table  11.12 
"Simplified  Updated  General  Contingency  Table". 


Table  11.11  Updated  General  Contingency  Table 


Factor  Levels 

Assumed  Distribution 

Observed  Freq. 

Expected  Freq. 

1 

Pi 

Oi 

Ei 

2 

P2 

02 

e2 

I 

PI 

Oi 

El 

Table  11.12  Simplified  Updated  General  Contingency  Table 


Factor  Levels 

Assumed  Distribution 

Observed  Freq. 

Expected  Freq. 

1 

Pi 

0 

E 

2 

P2 

0 

E 

I 

PI 

0 

E 

Here  is  the  test  statistic  for  the  general  hypothesis  based  on  Table  11.12  "Simplified 
Updated  General  Contingency  Table",  together  with  the  conditions  that  it  follow  a 
chi-square  distribution. 
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Test  Statistic  for  Testing  Goodness  of  Fit  to  a  Discrete 
Probability  Distribution 


X 


2 


=  X 


(O  -  Ef 
E 


where  the  sum  is  over  all  the  rows  of  the  table  (one  for  each  value  of  X). 


If 


1.  the  true  probability  distribution  of  X  is  as  assumed,  and 

2.  the  observed  count  0  of  each  cell  in  Table  11.12  "Simplified 
Updated  General  Contingency  Table"  is  at  least  5, 


then/2  approximately  follows  a  chi-square  distribution  with  df  —  I—  1 
degrees  of  freedom. 


The  test  is  known  as  a  goodness-of-fit /2  test  since  it  tests  the  null  hypothesis  that 
the  sample  fits  the  assumed  probability  distribution  well.  It  is  always  right-tailed, 
since  deviation  from  the  assumed  probability  distribution  corresponds  to  large 
values  of/2. 


Testing  is  done  using  either  of  the  usual  five-step  procedures. 
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EXAMPLE  2 


Table  11.13  "Ethnic  Groups  in  the  Census  Year"  shows  the  distribution  of 
various  ethnic  groups  in  the  population  of  a  particular  state  based  on  a 
decennial  U.S.  census.  Five  years  later  a  random  sample  of  2,500  residents  of 
the  state  was  taken,  with  the  results  given  in  Table  11.14  "Sample  Data  Five 
Years  After  the  Census  Year"  (along  with  the  probability  distribution  from 
the  census  year).  Test,  at  the  1%  level  of  significance,  whether  there  is 
sufficient  evidence  in  the  sample  to  conclude  that  the  distribution  of  ethnic 
groups  in  this  state  five  years  after  the  census  had  changed  from  that  in  the 
census  year. 


TABLE  11.13  ETHNIC  GROUPS  IN  THE  CENSUS  YEAR 


Ethnicity 

White 

Black 

Amer. -Indian 

Hispanic 

Asian 

Others 

Proportion 

0.743 

0.216 

0.012 

0.012 

0.008 

0.009 

TABLE  11.14  SAMPLE  DATA  FIVE  YEARS  AFTER  THE 

CENSUS  YEAR 


Ethnicity 

Assumed  Distribution 

Observed  Frequency 

White 

0.743 

1732 

Black 

0.216 

538 

American-Indian 

0.012 

32 

Hispanic 

0.012 

42 

Asian 

0.008 

133 

Others 

0.009 

23 

Solution: 

We  test  using  the  critical  value  approach. 


•  Step  1.  The  hypotheses  of  interest  in  this  case  can  be  expressed 
as 
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Ho  :  The  distribution  of  ethnic  groups  has  not  changed 
vs.  Ha  :  The  distribution  of  ethnic  groups  has  changed 

•  Step  2.  The  distribution  is  chi-square. 

•  Step  3.  To  compute  the  value  of  the  test  statistic  we  must  first 
compute  the  expected  number  for  each  row  of  Table  11.14 
"Sample  Data  Five  Years  After  the  Census  Year".  Since  n  =  2500, 
using  the  formula  E[  =  n  X  p  ■  and  the  values  of  p,  from  either 
Table  11.13  "Ethnic  Groups  in  the  Census  Year"  or  Table  11.14 
"Sample  Data  Five  Years  After  the  Census  Year", 

Ei  =  2500x0.743  =  1857.5 
E2  =  2500  X  0.216  =  540 
E3  =  2500  X  0.012  =  30 
E4  =  2500  X  0.012  =  30 
E5  =  2500  X  0.008  =  20 
E6  =  2500  X  0.009  =  22.5 

Table  11.14  "Sample  Data  Five  Years  After  the  Census  Year"  is 

updated  to  Table  11.15  "Observed  and  Expected  Frequencies  Five 
Years  After  the  Census  Year". 


TABLE  11.15  OBSERVED  AND  EXPECTED 
FREQUENCIES  FIVE  YEARS  AFTER  THE  CENSUS 

YEAR 


Ethnicity 

Assumed  Dist. 

Observed  Freq. 

Expected  Freq. 

White 

0.743 

1732 

1857.5 

Black 

0.216 

538 

540 

American-Indian 

0.012 

32 

30 

Hispanic 

0.012 

42 

30 

Asian 

0.008 

133 

20 

Others 

0.009 

23 

22.5 
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The  value  of  the  test  statistic  is 


7 


=  £ 


(O  -  Ef 


(1732  -  1857.5)_  (538  -  540)  ~  (32  -  3Q)2  (42_ 


1857.5 


540 


30 


(133  -  20)2  (23  -2 

+  • - — -  —  +  - 


20 


-  651.881 


Since  the  random  variable  takes  six  values,  1=6.  Thus  the  test 
statistic  follows  the  chi-square  distribution  with 
df  =  6-1=5  degrees  of  freedom. 


Since  the  test  is  right-tailed,  the  critical  value  isjfc'g  qj  •  Reading 
from  Figure  12.4  "Critical  Values  of  Chi-Square  Distributions", 

XqqI  —  15.086 ,  so  the  rejection  region  is  [  1 5.086,  <x>)  . 

Since  651.881  >  15.086  the  decision  is  to  reject  the  null  hypothesis.  See 
Figure  11.6.  The  data  provide  sufficient  evidence,  at  the  1%  level  of 
significance,  to  conclude  that  the  ethnic  distribution  in  this  state  has 
changed  in  the  five  years  since  the  U.S.  census. 


Figure  11.6 

Note  11.15  "Example  2" 


22. 
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KEY  TAKEAWAY 


•  The  chi-square  goodness-of-fit  test6  can  be  used  to  evaluate  the 
hypothesis  that  a  sample  is  taken  from  a  population  with  an  assumed 
specific  probability  distribution. 


6.  A  test  based  on  a  chi-square 
statistic  to  check  whether  a 
sample  is  taken  from  a 
population  with  a  hypothesized 
probability  distribution. 


11.2  Chi-Square  One-Sample  Goodness-of-Fit  Tests 


658 


Chapter  11  Chi-Square  Tests  and  F-Tests 


1.  A  data  sample  is  sorted  into  five  categories  with  an  assumed  probability 
distribution. 


Factor  Levels 

Assumed  Distribution 

Observed  Frequency 

1 

^3 

II 

O 

10 

2 

Pi  =  0.4 

35 

3 

p3  =  0.4 

45 

4 

P\  =  0.1 

10 

a.  Find  the  size  n  of  the  sample. 

b.  Find  the  expected  number  £  of  observations  for  each  level,  if  the  sampled 
population  has  a  probability  distribution  as  assumed  (that  is,  just  use  the 
formula  =  n  X  /?■). 

9 

c.  F ind  the  chi-square  test  statistic  /  "  • 

d.  Find  the  number  of  degrees  of  freedom  of  the  chi-square  test  statistic. 

2.  A  data  sample  is  sorted  into  five  categories  with  an  assumed  probability 
distribution. 


Factor  Levels 

Assumed  Distribution 

Observed  Frequency 

1 

3s 

II 

O 

Lo 

23 

2 

Pi  =  0.3 

30 

3 

p3  =  0.2 

19 

4 

^3 

II 

O 

8 

5 

^5  =0.1 

10 

a.  Find  the  size  n  of  the  sample. 

b.  Find  the  expected  number  £  of  observations  for  each  level,  if  the  sampled 
population  has  a  probability  distribution  as  assumed  (that  is,  just  use  the 
formula  Et  =  n  X  p). 

2 

c.  Find  the  chi-square  test  statistic^  . 

d.  Find  the  number  of  degrees  of  freedom  of  the  chi-square  test  statistic. 
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APPLICATIONS 


3.  Retailers  of  collectible  postage  stamps  often  buy  their  stamps  in  large 
quantities  by  weight  at  auctions.  The  prices  the  retailers  are  willing  to  pay 
depend  on  how  old  the  postage  stamps  are.  Many  collectible  postage  stamps  at 
auctions  are  described  by  the  proportions  of  stamps  issued  at  various  periods 
in  the  past.  Generally  the  older  the  stamps  the  higher  the  value.  At  one 
particular  auction,  a  lot  of  collectible  stamps  is  advertised  to  have  the  age 
distribution  given  in  the  table  provided.  A  retail  buyer  took  a  sample  of  73 
stamps  from  the  lot  and  sorted  them  by  age.  The  results  are  given  in  the  table 
provided.  Test,  at  the  5%  level  of  significance,  whether  there  is  sufficient 
evidence  in  the  data  to  conclude  that  the  age  distribution  of  the  lot  is  different 
from  what  was  claimed  by  the  seller. 


Year 

Claimed  Distribution 

Observed  Frequency 

Before  1940 

0.10 

6 

1940  to  1959 

0.25 

15 

1960  to  1979 

0.45 

30 

After  1979 

0.20 

22 

4.  The  litter  size  of  Bengal  tigers  is  typically  two  or  three  cubs,  but  it  can  vary 
between  one  and  four.  Based  on  long-term  observations,  the  litter  size  of 
Bengal  tigers  in  the  wild  has  the  distribution  given  in  the  table  provided.  A 
zoologist  believes  that  Bengal  tigers  in  captivity  tend  to  have  different 
(possibly  smaller)  litter  sizes  from  those  in  the  wild.  To  verify  this  belief,  the 
zoologist  searched  all  data  sources  and  found  316  litter  size  records  of  Bengal 
tigers  in  captivity.  The  results  are  given  in  the  table  provided.  Test,  at  the  5% 
level  of  significance,  whether  there  is  sufficient  evidence  in  the  data  to 
conclude  that  the  distribution  of  litter  sizes  in  captivity  differs  from  that  in  the 
wild. 


Litter  Size 

Wild  Litter  Distribution 

Observed  Frequency 

1 

0.11 

41 

2 

0.69 

243 

3 

0.18 

27 

4 

0.02 

5 

5.  An  online  shoe  retailer  sells  men’s  shoes  in  sizes  8  to  13.  In  the  past  orders  for 
the  different  shoe  sizes  have  followed  the  distribution  given  in  the  table 
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provided.  The  management  believes  that  recent  marketing  efforts  may  have 
expanded  their  customer  base  and,  as  a  result,  there  may  be  a  shift  in  the  size 
distribution  for  future  orders.  To  have  a  better  understanding  of  its  future 
sales,  the  shoe  seller  examined  1,040  sales  records  of  recent  orders  and  noted 
the  sizes  of  the  shoes  ordered.  The  results  are  given  in  the  table  provided.  Test, 
at  the  1%  level  of  significance,  whether  there  is  sufficient  evidence  in  the  data 
to  conclude  that  the  shoe  size  distribution  of  future  sales  will  differ  from  the 
historic  one. 


Shoe  Size 

Past  Size  Distribution 

Recent  Size  Frequency 

8.0 

0.03 

25 

8.5 

0.06 

43 

9.0 

0.09 

88 

9.5 

0.19 

221 

10.0 

0.23 

272 

10.5 

0.14 

150 

11.0 

0.10 

107 

11.5 

0.06 

51 

12.0 

0.05 

37 

12.5 

0.03 

35 

13.0 

0.02 

11 

6.  An  online  shoe  retailer  sells  women’s  shoes  in  sizes  5  to  10.  In  the  past  orders 
for  the  different  shoe  sizes  have  followed  the  distribution  given  in  the  table 
provided.  The  management  believes  that  recent  marketing  efforts  may  have 
expanded  their  customer  base  and,  as  a  result,  there  may  be  a  shift  in  the  size 
distribution  for  future  orders.  To  have  a  better  understanding  of  its  future 
sales,  the  shoe  seller  examined  1,174  sales  records  of  recent  orders  and  noted 
the  sizes  of  the  shoes  ordered.  The  results  are  given  in  the  table  provided.  Test, 
at  the  1%  level  of  significance,  whether  there  is  sufficient  evidence  in  the  data 
to  conclude  that  the  shoe  size  distribution  of  future  sales  will  differ  from  the 
historic  one. 


Shoe  Size 

Past  Size  Distribution 

Recent  Size  Frequency 

5.0 

0.02 

20 

5.5 

0.03 

23 
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Shoe  Size 

Past  Size  Distribution 

Recent  Size  Frequency 

6.0 

0.07 

88 

6.5 

0.08 

90 

7.0 

0.20 

222 

7.5 

0.20 

258 

8.0 

0.15 

177 

8.5 

0.11 

121 

9.0 

0.08 

91 

9.5 

0.04 

53 

10.0 

0.02 

31 

7.  A  chess  opening  is  a  sequence  of  moves  at  the  beginning  of  a  chess  game.  There 
are  many  well-studied  named  openings  in  chess  literature.  French  Defense  is 
one  of  the  most  popular  openings  for  black,  although  it  is  considered  a 
relatively  weak  opening  since  it  gives  black  probability  0.344  of  winning, 
probability  0.405  of  losing,  and  probability  0.251  of  drawing.  A  chess  master 
believes  that  he  has  discovered  a  new  variation  of  French  Defense  that  may 
alter  the  probability  distribution  of  the  outcome  of  the  game.  In  his  many 
Internet  chess  games  in  the  last  two  years,  he  was  able  to  apply  the  new 
variation  in  77  games.  The  wins,  losses,  and  draws  in  the  77  games  are  given  in 
the  table  provided.  Test,  at  the  5%  level  of  significance,  whether  there  is 
sufficient  evidence  in  the  data  to  conclude  that  the  newly  discovered  variation 
of  French  Defense  alters  the  probability  distribution  of  the  result  of  the  game. 


Result  for 
Black 

Probability 

Distribution 

New  Variation 

Wins 

Win 

0.344 

31 

Loss 

0.405 

25 

Draw 

0.251 

21 

8.  The  Department  of  Parks  and  Wildlife  stocks  a  large  lake  with  fish  every  six 
years.  It  is  determined  that  a  healthy  diversity  of  fish  in  the  lake  should  consist 
of  10%  largemouth  bass,  15%  smallmouth  bass,  10%  striped  bass,  10%  trout,  and 
20%  catfish.  Therefore  each  time  the  lake  is  stocked,  the  fish  population  in  the 
lake  is  restored  to  maintain  that  particular  distribution.  Every  three  years,  the 
department  conducts  a  study  to  see  whether  the  distribution  of  the  fish  in  the 
lake  has  shifted  away  from  the  target  proportions.  In  one  particular  year,  a 
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research  group  from  the  department  observed  a  sample  of  292  fish  from  the 
lake  with  the  results  given  in  the  table  provided.  Test,  at  the  5%  level  of 
significance,  whether  there  is  sufficient  evidence  in  the  data  to  conclude  that 
the  fish  population  distribution  has  shifted  since  the  last  stocking. 


Fish 

Target  Distribution 

Fish  in  Sample 

Largemouth  Bass 

0.10 

14 

Smallmouth  Bass 

0.15 

49 

Striped  Bass 

0.10 

21 

Trout 

0.10 

22 

Catfish 

0.20 

75 

Other 

0.35 

111 

LARGE  DATA  SET  EXERCISE 


9.  Large  Data  Set  4  records  the  result  of  500  tosses  of  six-sided  die.  Test,  at  the 
10%  level  of  significance,  whether  there  is  sufficient  evidence  in  the  data  to 
conclude  that  the  die  is  not  “fair”  (or  “balanced”),  that  is,  that  the  probability 
distribution  differs  from  probability  l/ 6  for  each  of  the  six  faces  on  the  die. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data4.xls 


ANSWERS 


1.  a.  n  =  100, 

b.  £  =  10,  £  =  40,  £  =  40,  £=10; 

C.  /2  =  1.25, 
d.  (If  —  3 

3.  =  4.8082,/q05  =  7.81,  do  not  reject  Ho 

5.  /2  =  26.5765  ,Xom  =  23.21 ,  reject  h0 

7.  x'2'  =  2.1401  ,Xq  05  =  5.99,  do  not  reject  Ho 

9.  X 2  —  2.944.  df  =  5.  Rejection  Region:  [9.236,  co)  .  Decision:  Fail  to 
reject  Ho  of  balance. 
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11.3  F-tests  for  Equality  of  Two  Variances 


LEARNING  OBJECTIVES 

1.  To  understand  what  F-distributions  are. 

2.  To  understand  how  to  use  an  F-test  to  judge  whether  two  population 
variances  are  equal. 


F-Distributions 

Another  important  and  useful  family  of  distributions  in  statistics  is  the  family  of  F- 
distributions.  Each  member  of  the  F-distribution  family  is  specified  by  a  pair  of 
parameters  called  degrees  of  freedom  and  denoted  df l  and  df2  ■  Figure  11.7  "Many  " 
shows  several  F-distributions  for  different  pairs  of  degrees  of  freedom.  An  F 
random  variable7  is  a  random  variable  that  assumes  only  positive  values  and 
follows  an  F-distribution. 


Figure  11.7  Many  F-Distributions 


7.  A  random  variable  following  an 
F-distribution. 
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The  parameter  dj\  is  often  referred  to  as  the  numerator  degrees  of  freedom  and  the 
parameter  df2  as  the  denominator  degrees  of  freedom.  It  is  important  to  keep  in 
mind  that  they  are  not  interchangeable.  For  example,  the  F-distribution  with 
degrees  of  freedom  d(\  —  3  and  df2  —  8  is  a  different  distribution  from  the  F- 
distribution  with  degrees  of  freedom  dfx  —  8  and  df2  —  3. 


Definition 

The  value  of  the  F  random  variable  F  with  degrees  of  freedom  df\  and  df2  that  cuts  off 
a  right  tail  of  area  c  is  denoted  Fc  and  is  called  a  critical  value.  See  Figure  11.8. 


Figure  11.8 

Fc  Illustrated 


Tables  containing  the  values  of  Fc  are  given  in  Chapter  11  "Chi-Square  Tests  and  ". 
Each  of  the  tables  is  for  a  fixed  collection  of  values  of  c,  either  0.900,  0.950,  0.975, 
0.990,  and  0.995  (yielding  what  are  called  “lower”  critical  values),  or  0.005,  0.010, 
0.025,  0.050,  and  0.100  (yielding  what  are  called  “upper”  critical  values).  In  each 
table  critical  values  are  given  for  various  pairs  ( df\ ,  df2 )  .We  illustrate  the  use  of 
the  tables  with  several  examples. 
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EXAMPLE  3 


Suppose  F  is  an  F  random  variable  with  degrees  of  freedom  dfy  =  5  and 
df2  =  4.  Use  the  tables  to  find 

a.  Fo.io 

b.  F0.95 

Solution: 


a.  The  column  headings  of  all  the  tables  contain  dfy  =5.  Look  for 
the  table  for  which  0.10  is  one  of  the  entries  on  the  extreme  left 
(a  table  of  upper  critical  values)  and  that  has  a  row  heading 
df2=4  in  the  left  margin  of  the  table.  A  portion  of  the  relevant 
table  is  provided.  The  entry  in  the  intersection  of  the  column 
with  heading  dfy  =  5  and  the  row  with  the  headings  0.10  and 
t/A  =4,  which  is  shaded  in  the  table  provided,  is  the  answer, 

Fo.io  =  4.05. 


F  Tail  Area 

dfx 

1 

2 

5 

df2 

0.005 

4 

22.5 

0.01 

4 

15.5 

0.025 

4 

9.36 

0.05 

4 

6.26 

0.10 

4 

1.05 

b.  Look  for  the  table  for  which  0.95  is  one  of  the  entries  on  the 
extreme  left  (a  table  of  lower  critical  values)  and  that  has  a  row 
heading  df2  =  4  in  the  left  margin  of  the  table.  A  portion  of  the 
relevant  table  is  provided.  The  entry  in  the  intersection  of  the 
column  with  heading  dfy  =  5  and  the  row  with  the  headings 
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0.95  and  =  4,  which  is  shaded  in  the  table  provided,  is  the 
answer,  Fog$  =  0.19. 


F  Tail  Area 

dfx 

1 

2 

5 

df2 

0.90 

4 

0.28 

0.95 

4 

0.19 

0.975 

4 

0.14 

0.99 

4 

0.09 

0.995 

4 

0.06 
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EXAMPLE  4 


Suppose  F  is  an  F  random  variable  with  degrees  of  freedom  dfy  =  2  and 
df2  =  20.  Let  a  =  0.05.  Use  the  tables  to  find 


a.  Fa 
b-  Fa/ 2 
c.  F\-a 
d-  F\_a/2 

Solution: 


a.  The  column  headings  of  all  the  tables  contain  dfy  =2.  Look  for 
the  table  for  which  CL  —  0.05  is  one  of  the  entries  on  the 
extreme  left  (a  table  of  upper  critical  values)  and  that  has  a  row 
heading  C//2  =  20  in  the  left  margin  of  the  table.  A  portion  of 
the  relevant  table  is  provided.  The  shaded  entry,  in  the 
intersection  of  the  column  with  heading  dfy  —  2  and  the  row 
with  the  headings  0.05  and  df7  =  20  is  the  answer, 

F0.05  =  3.49. 


F  Tail  Area 

dfx 

1 

2 

df2 

0.005 

20 

6.99 

0.01 

20 

5.85 

0.025 

20 

4.46 

0.05 

20 

3.49 

0.10 

20 

2.59 

b.  Look  for  the  table  for  which  a  /  2  =  0.025  is  one  of  the 
entries  on  the  extreme  left  (a  table  of  upper  critical  values)  and 
that  has  a  row  heading  (i/2  =  20  in  the  left  margin  of  the  table, 
A  portion  of  the  relevant  table  is  provided.  The  shaded  entry,  in 
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the  intersection  of  the  column  with  heading  dfy  =  2  and  the 
row  with  the  headings  0.025  and  J/2  —  20  is  the  answer, 

^0.025  =  4.46. 


F  Tail  Area 

dfx 

1 

2 

df2 

0.005 

20 

6.99 

0.01 

20 

5.85 

0.025 

20 

4.46 

0.05 

20 

3.49 

0.10 

20 

2.59 

c.  Look  for  the  table  for  which  1  —  a  —  0.95  is  one  of  the  entries 
on  the  extreme  left  (a  table  of  lower  critical  values)  and  that  has 
a  row  heading  c/jf?  —  20  in  the  left  margin  of  the  table.  A 
portion  of  the  relevant  table  is  provided.  The  shaded  entry,  in 
the  intersection  of  the  column  with  heading  df\  =  2  and  the 
row  with  the  headings  0.95  and  J/2  —  20  is  the  answer, 

Fo.95  =  0.05. 


F  Tail  Area 

dfx 

1 

2 

df2 

0.90 

20 

0.11 

0.95 

20 

0.05 

0.975 

20 

0.03 

0.99 

20 

0.01 

0.995 

20 

0.01 
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d.  Look  for  the  table  for  which  1  —  a  /  2  =  0.975  is  one  of  the 
entries  on  the  extreme  left  (a  table  of  lower  critical  values)  and 
that  has  a  row  heading  df2  —  20  in  the  left  margin  of  the  table. 
A  portion  of  the  relevant  table  is  provided.  The  shaded  entry,  in 
the  intersection  of  the  column  with  heading  df\  =  2  and  the 
row  with  the  headings  0.975  and  df2  =  20  is  the  answer, 

F 1975  =  0.03. 


F  Tail  Area 

dh 

1 

2 

df2 

0.90 

20 

0.11 

0.95 

20 

0.05 

0.975 

20 

0.03 

0.99 

20 

0.01 

0.995 

20 

0.01 

A  fact  that  sometimes  allows  us  to  find  a  critical  value  from  a  table  that  we  could 
not  read  otherwise  is: 


If  Fu  (r,  ^denotes  the  value  of  the  F-distribution  with  degrees  of  freedom 
dj\  —  rand  df2  —  .vthat  cuts  off  a  right  tail  of  area  u,  then 


Fc  (k,  f) 


1 

F\-c  (e,k) 
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EXAMPLE  5 


Use  the  tables  to  find 

a.  Fo.oi  for  an  F  random  variable  with  df  \  =  1  3  and  df2  =  8 

b.  F0.975  for  an  F  random  variable  with  df  y  =40  and  df2  =  10 

Solution: 


a.  There  is  no  table  with  df  )  =  1 3,  but  there  is  one  with 
df  i  =  8 .  Thus  we  use  the  fact  that 


Fq.01  (13,8)  — 


1 

Fo.99  (8,13) 


Using  the  relevant  table  we  find  that  Fo  gg  (8,13)  =  0.18, 
henceFo.oi  (13,8)  =  0.18-1  =  5.556. 


b.  There  is  no  table  with  df  y  =  40,  but  there  is  one  with 
df  \  =  10.  Thus  we  use  the  fact  that 


F 0.975  (40,10)  = 


1 

F xo25  (10,40) 


Using  the  relevant  table  we  find  that  /7'o.025  (10,40)  =  3.31, 
hence  F0.975  (40,10)  =  3.31  1  =  0.302. 


F-Tests  for  Equality  of  Two  Variances 


8.  A  test  based  on  an  F  statistic  to 
check  whether  two  population 
variances  are  equal. 


8In  Chapter  9  "Two-Sample  Problems"  we  saw  how  to  test  hypotheses  about  the 
difference  between  two  population  means  fiy  and  /t2  •  In  some  practical  situations 
the  difference  between  the  population  standard  deviations  o\  and  <72  is  also  of 
interest.  Standard  deviation  measures  the  variability  of  a  random  variable.  For 
example,  if  the  random  variable  measures  the  size  of  a  machined  part  in  a 
manufacturing  process,  the  size  of  standard  deviation  is  one  indicator  of  product 
quality.  A  smaller  standard  deviation  among  items  produced  in  the  manufacturing 
process  is  desirable  since  it  indicates  consistency  in  product  quality. 
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For  theoretical  reasons  it  is  easier  to  compare  the  squares  of  the  population 
standard  deviations,  the  population  variances  a j2  and  .  This  is  not  a  problem, 
since  o\  —  02  precisely  when  oy  =  o^,o\  <  <72  precisely  when  0^  <  07,  and 
o\  >  02  precisely  when  0^  >  o\ . 

The  null  hypothesis  always  has  the  form  Hq  :  0^  —  0} . The  three  forms  of  the 
alternative  hypothesis,  with  the  terminology  for  each  case,  are: 


Form  of  Ha 

Terminology 

Ha  :  0\  >  0I 

Right-tailed 

Ha  :  0\  <  o\ 

Left-tailed 

Ha  :  0\  ±  o\ 

Two-tailed 

Just  as  when  we  test  hypotheses  concerning  two  population  means,  we  take  a 
random  sample  from  each  population,  of  sizes  ni  and  n 2,  and  compute  the  sample 
standard  deviations  si  and  S2.  In  this  context  the  samples  are  always  independent. 
The  populations  themselves  must  be  normally  distributed. 


Test  Statistic  for  Hypothesis  Tests  Concerning  the 
Difference  Between  Two  Population  Variances 


If  the  two  populations  are  normally  distributed  and  if  Hq  :  0^  —  0}  is  true 
then  under  independent  sampling  F  approximately  follows  an  F-distribution 
with  degrees  of  freedom  df{  —  n  1  —  1  and  df2  —  «2  —  1  ■ 


A  test  based  on  the  test  statistic  F  is  called  an  F-test. 


A  most  important  point  is  that  while  the  rejection  region  for  a  right-tailed  test  is 
exactly  as  in  every  other  situation  that  we  have  encountered,  because  of  the 
asymmetry  in  the  F-distribution  the  critical  value  for  a  left -tailed  test  and  the  lower 
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critical  value  for  a  two-tailed  test  have  the  special  forms  shown  in  the  following 
table: 


Terminology 

Alternative  Hypothesis 

Rejection  Region 

Right-tailed 

Ha  :  o\  >  o\ 

F>Fa 

Left-tailed 

Ha  :  o\  <  a\ 

F  <  Fi_a 

Two-tailed 

Ha  :  o\  ±  o\ 

F  <  F\_a/2  or  F  >  Fa/2 

Figure  11.9  "Rejection  Regions:  (a)  Right-Tailed:  (b)  Left -Tailed:  (c)  Two-Tailed" 

illustrates  these  rejection  regions. 


Figure  11.9  Rejection  Regions:  (a)  Right-Tailed;  (b)  Left-Tailed;  (c)  Two-Tailed 


Ha  -.o\±o\ 


The  test  is  performed  using  the  usual  five-step  procedure  described  at  the  end  of 
Section  8.1  "The  Elements  of  Hypothesis  Testing"  in  Chapter  8  "Testing 
Hypotheses". 
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EXAMPLE  6 


One  of  the  quality  measures  of  blood  glucose  meter  strips  is  the  consistency 
of  the  test  results  on  the  same  sample  of  blood.  The  consistency  is  measured 
by  the  variance  of  the  readings  in  repeated  testing.  Suppose  two  types  of 
strips,  A  and  B,  are  compared  for  their  respective  consistencies.  We 
arbitrarily  label  the  population  of  Type  A  strips  Population  1  and  the 
population  of  Type  B  strips  Population  2.  Suppose  15  Type  A  strips  were 
tested  with  blood  drops  from  a  well-shaken  vial  and  20  Type  B  strips  were 
tested  with  the  blood  from  the  same  vial.  The  results  are  summarized  in 
Table  11.16  "Two  Types  of  Test  Strips".  Assume  the  glucose  readings  using 

2 

Type  A  strips  follow  a  normal  distribution  with  variance  Cy  and  those  using 

2 

Type  B  strips  follow  a  normal  distribution  with  variance  with  fjy  •  Test,  at 
the  10%  level  of  significance,  whether  the  data  provide  sufficient  evidence  to 
conclude  that  the  consistencies  of  the  two  types  of  strips  are  different. 


TABLE  11.16  TWO  TYPES  OF  TEST  STRIPS 


Strip  Type 

Sample  Size 

Sample  Variance 

A 

n\  —  16 

s\  =  2.09 

B 

n  2  =  21 

s\  =  1.10 

Solution: 


•  Step  1.  The  test  of  hypotheses  is 

Hq  :  o\  =  c>l 

vs .Ha  \  c\  ±  g\  @  a  =  0.10 

•  Step  2.  The  distribution  is  the  F-distribution  with  degrees  of  freedom 

df\  =16  -1  =  15  and  df2  =  21  -  1  =  20. 

•  Step  3.  The  test  is  two-tailed.  The  left  or  lower  critical  value  is 
F\  —a /2  =  Fq  95  =  0.43.  The  right  or  upper  critical  value  is 
Fa/2  —  Fq. 05  —  2.20.  Thus  the  rejection  region  is 

[0,  —0.43]  U  [2.20,  oo)  ,  as  illustrated  in  Figure  11.10  "Rejection 
Region  and  Test  Statistic  for  ". 
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Figure  11.10 

Rejection  Region  and 
Test  Statistic  for  Note 

11.27  "Example  6" 

§  =  0.05 


•  Step  4.  The  value  of  the  test 


s\  1.10 

•  Step  5.  As  shown  in  Figure  11.10  "Rejection  Region  and  Test  Statistic  for 
",  the  test  statistic  1.90  does  not  lie  in  the  rejection  region,  so  the 
decision  is  not  to  reject  Ho.  The  data  do  not  provide  sufficient  evidence, 
at  the  10%  level  of  significance,  to  conclude  that  there  is  a  difference  in 
the  consistency,  as  measured  by  the  variance,  of  the  two  types  of  test 
strips. 


statistic  is 


2.09 


_  i  nn 


11.3  F-tests  for  Equality  of  Two  Variances 


675 


Chapter  11  Chi-Square  Tests  and  F-Tests 


EXAMPLE  7 


In  the  context  of  Note  11.27  "Example  6",  suppose  Type  A  test  strips  are  the 
current  market  leader  and  Type  B  test  strips  are  a  newly  improved  version 
of  Type  A.  Test,  at  the  10%  level  of  significance,  whether  the  data  given  in 
Table  11.16  "Two  Types  of  Test  Strips"  provide  sufficient  evidence  to 
conclude  that  Type  B  test  strips  have  better  consistency  (lower  variance) 
than  Type  A  test  strips. 

Solution: 


•  Step  1.  The  test  of  hypotheses  is  now 

Hq  :  a\  =  o\ 

vs .Ha  :  a\  >  a\  @a  =  0.10 

•  Step  2.  The  distribution  is  the  F-distribution  with  degrees  of  freedom 

df\  =  16  -  1  =  15  and  df2  =  21  -  1  =  20. 

•  Step  3.  The  value  of  the  test  statistic  is 


s\  _  2.09 

4  "  lTo 


1.90 


•  Step  4.  The  test  is  right-tailed.  The  single  critical  value  is 
Fa  =  Fq.io  =  1.84.  Thus  the  rejection  region  is  [l.84,  oo)  ,  as 
illustrated  in  Figure  11.11  "Rejection  Region  and  Test  Statistic  for  ". 


Figure  11.11 

Rejection  Region  and 
Test  Statistic  for  Note 
11.28  "Example  7" 
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•  Step  5.  As  shown  in  Figure  11.11  "Rejection  Region  and  Test  Statistic  for 
",  the  test  statistic  1.90  lies  in  the  rejection  region,  so  the  decision  is  to 
reject  Ho.  The  data  provide  sufficient  evidence,  at  the  10%  level  of 
significance,  to  conclude  that  Type  B  test  strips  have  better  consistency 
(lower  variance)  than  Type  A  test  strips  do. 


KEY  TAKEAWAYS 


•  Critical  values  of  an  F-distribution  with  degrees  of  freedom  df  ^  and  q/2 
are  found  in  tables  in  Chapter  12  "Appendix". 

•  An  F-test  can  be  used  to  evaluate  the  hypothesis  of  two  identical  normal 
population  variances. 
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EXERCISES 


BASIC 


1.  Find  Fq.oi  for  each  of  the  following  degrees  of  freedom. 


a.  dfi  —  5andt//2  —  5 

b.  dfi  —  5andt//2  =  12 

c.  df  i  =  12  and  df2  —  20 

2.  Find  F 0.05  for  each  of  the  following  degrees  of  freedom. 

a.  df\  —  6  and  df2  =  6 

b.  df\  =  6  and  df2  =  12 

c.  df i  =  12andc//2  =  30 

3.  Find  Fo.95  for  each  of  the  following  degrees  of  freedom. 

a.  4/|  =  6  and  df2  =  6 

b.  dj\  —  6  and  df2  =  12 

c.  df  i  =  12  and  J/2  —  30 

4.  Find  Fq.90  for  each  of  the  following  degrees  of  freedom. 


a.  c//" [  =  5  and  df  2  =  5 

b.  dfi  —  5andt//2  =  12 

c.  df  1  =  12andt//2  =  20 

5.  Forc//j  =  7,  df2  =  10  and  Ct  =  0.05,  find 

a.  Fa 

b.  To  -a 

c.  Fa/2 

d.  F\-a/2 


6.  Forc//j  =  15,c//2  =  8,  and  a  =  0.01,  find 


a. 

b.  TO -« 
c-  T^w/2 

d.  Fx_(X,2 


7.  For  each  of  the  two  samples 
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Sample  1  :  {8,2,1 1,0, -2, } 
Sample  2  :  {-2, 0,0, 0,2, 4,  -1 } 


a.  the  sample  size, 

b.  the  sample  mean, 

c.  the  sample  variance. 

8.  For  each  of  the  two  samples 

Sample  1  :  {0.  8,1.  2,1. 1,0.  8,  -2.0} 

Sample  2  :  {-2. 0,0. 0,0.  7,0.  8,2.  2,4. 1,  —1.9} 

find 


a.  the  sample  size, 

b.  the  sample  mean, 

c.  the  sample  variance. 

9.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

n\  =  16 

s \  =  53 

2 

n2  =  21 

,2=32 

a.  Find  the  statistic  F  =  j  s 

b.  Find  the  degrees  of  freedom  df y  and  J/2  • 

c.  Find  Fo.05  using  df  y  and  df2  computed  above. 

d.  Perform  the  test  the  hypotheses  Hq  Oy  =  0^vs.Ha  Oy  >  O^atthe 
5%  level  of  significance. 

10.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

ny  =  11 

s\  =  61 

2 

n2  —  8 

s\  =  44 

a.  Find  the  statistic  F  — 

b.  Find  the  degrees  of  freedom  dfy  and  df \- 

c.  Find  Fq.05  using  df  y  and  dfj  computed  above. 
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d.  Perform  the  test  the  hypotheses  Hq  oj  =  (7^vs.Ha  :  Oy  >  O^atthe 
5%  level  of  significance. 

11.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

ny  =  10 

s\  =  12 

2 

n2  =  13 

^  =23 

a.  Find  the  statistic  F  =  j  S 

b.  Find  the  degrees  of  freedom  df y  and  df^  ■ 

c.  Fora  =  0.05  find  Fy  _a  using  dfy  and  df  \  computed  above. 

d.  Perform  the  test  the  hypotheses  Hq  \  Oy  =  0^vs.Ha  \  Oy  <  O^ at  the 
5%  level  of  significance. 

12.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

ny  =  8 

s\  =  102 

2 

n2  =  8 

s\  =  603 

a.  Find  the  statistic  F  =  j  S 

b.  Find  the  degrees  of  freedom  dfy  and  df 

c.  Fora  =  0.05  find  F\  _a  using  dfy  and.  df 2  computed  above. 

d.  Perform  the  test  the  hypotheses  Hq  Oy  =  0^vs.Ha  Oy  <  O^ at  the 
5%  level  of  significance. 

13.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

ii\  =  9 

s\  =  123 

2 

n2  =  31 

s\  =  543 

a.  Find  the  statistic  F  =  s\  j  S^. 

b.  Find  the  degrees  of  freedom  dfy  and  df 

c.  Fora  =  0.05  find  Fy -a/2  and/^a/ 2  using  dfy  and  df  \  computed 
above. 
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d.  Perform  the  test  the  hypotheses  Hq  oj  =  (7^vs.Ha  :  Op  ^  at  the 
5%  level  of  significance. 

14.  Two  random  samples  taken  from  two  normal  populations  yielded  the  following 
information: 


Sample 

Sample  Size 

Sample  Variance 

1 

ri\  —  21 

s\  =  199 

2 

n2  =  21 

s\  =  66 

a.  Find  the  statistic  F  =  j  S 

b.  Find  the  degrees  of  freedom  df  y  and  df^  • 

c.  Fora  =  0.05  find/^i_a/2  and/^a/ 2  using  dfy  and  computed 
above. 

d.  Perform  the  test  the  hypotheses  Hq  Op  =  <7^vs.Ha  :  Cp  7^  <7^ at  the 
5%  level  of  significance. 


APPLICATIONS 


15.  Japanese  sturgeon  is  a  subspecies  of  the  sturgeon  family  indigenous  to  Japan 
and  the  Northwest  Pacific.  In  a  particular  fish  hatchery  newly  hatched  baby 
Japanese  sturgeon  are  kept  in  tanks  for  several  weeks  before  being  transferred 
to  larger  ponds.  Dissolved  oxygen  in  tank  water  is  very  tightly  monitored  by  an 
electronic  system  and  rigorously  maintained  at  a  target  level  of  6.5  milligrams 
per  liter  (mg/1).  The  fish  hatchery  looks  to  upgrade  their  water  monitoring 
systems  for  tighter  control  of  dissolved  oxygen.  A  new  system  is  evaluated 
against  the  old  one  currently  being  used  in  terms  of  the  variance  in  measured 
dissolved  oxygen.  Thirty-one  water  samples  from  a  tank  operated  with  the  new 
system  were  collected  and  16  water  samples  from  a  tank  operated  with  the  old 
system  were  collected,  all  during  the  course  of  a  day.  The  samples  yield  the 
following  information: 

New  Sample  1:  n\  =31  s\  =  0.0121 
Old  Sample  2:  n2  =  16  s\  =  0.0319 


Test,  at  the  10%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  new  system  will  provide  a  tighter  control  of 
dissolved  oxygen  in  the  tanks. 
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16.  The  risk  of  investing  in  a  stock  is  measured  by  the  volatility,  or  the  variance,  in 
changes  in  the  price  of  that  stock.  Mutual  funds  are  baskets  of  stocks  and  offer 
generally  lower  risk  to  investors.  Different  mutual  funds  have  different  focuses 
and  offer  different  levels  of  risk.  Hippolyta  is  deciding  between  two  mutual 
funds,  A  and  B,  with  similar  expected  returns.  To  make  a  final  decision,  she 
examined  the  annual  returns  of  the  two  funds  during  the  last  ten  years  and 
obtained  the  following  information: 


Mutual  Fund  A 

Sample  1  : 

n\  =  10 

4 

=  0.012 

Mutual  Fund  B 

Sample  2  : 

n2  =  10 

4 

=  0.005 

Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  two  mutual  funds  offer  different  levels  of  risk. 

17.  It  is  commonly  acknowledged  that  grading  of  the  writing  part  of  a  college 
entrance  examination  is  subject  to  inconsistency.  Every  year  a  large  number  of 
potential  graders  are  put  through  a  rigorous  training  program  before  being 
given  grading  assignments.  In  order  to  gauge  whether  such  a  training  program 
really  enhances  consistency  in  grading,  a  statistician  conducted  an  experiment 
in  which  a  reference  essay  was  given  to  61  trained  graders  and  31  untrained 
graders.  Information  on  the  scores  given  by  these  graders  is  summarized 
below: 

Trained  Sample  1:  n  i  =  61  Sj=2.15 

Untrained  Sample  2:  n2  =  31  s\  =  3.91 

Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  training  program  enhances  the  consistency  in 
essay  grading. 

18.  A  common  problem  encountered  by  many  classical  music  radio  stations  is  that 
their  listeners  belong  to  an  increasingly  narrow  band  of  ages  in  the  population. 
The  new  general  manager  of  a  classical  music  radio  station  believed  that  a  new 
playlist  offered  by  a  professional  programming  agency  would  attract  listeners 
from  a  wider  range  of  ages.  The  new  list  was  used  for  a  year.  Two  random 
samples  were  taken  before  and  after  the  new  playlist  was  adopted.  Information 
on  the  ages  of  the  listeners  in  the  sample  are  summarized  below: 
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Before  Sample  1:  ri\  =21  s\  =56.25 
After  Sample  2:  ni  =  16  s\  =  76.56 


Test,  at  the  10%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  new  playlist  has  expanded  the  range  of  listener 
ages. 

19.  A  laptop  computer  maker  uses  battery  packs  supplied  by  two  companies,  A  and 
B.  While  both  brands  have  the  same  average  battery  life  between  charges 
(LBC),  the  computer  maker  seems  to  receive  more  complaints  about  shorter 
LBC  than  expected  for  battery  packs  supplied  by  company  B.  The  computer 
maker  suspects  that  this  could  be  caused  by  higher  variance  in  LBC  for  Brand 
B.  To  check  that,  ten  new  battery  packs  from  each  brand  are  selected,  installed 
on  the  same  models  of  laptops,  and  the  laptops  are  allowed  to  run  until  the 
battery  packs  are  completely  discharged.  The  following  are  the  observed  LBCs 
in  hours. 


Brand  A 

Brand  B 

3.2 

3.0 

3.4 

3.5 

2.8 

2.9 

3.0 

3.1 

3.0 

2.3 

3.0 

2.0 

2.8 

3.0 

2.9 

2.9 

3.0 

3.0 

3.0 

4.1 

Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  LBCs  of  Brand  B  have  a  larger  variance  that  those 
of  Brand  A. 

20.  A  manufacturer  of  a  blood-pressure  measuring  device  for  home  use  claims  that 
its  device  is  more  consistent  than  that  produced  by  a  leading  competitor. 
During  a  visit  to  a  medical  store  a  potential  buyer  tried  both  devices  on  himself 
repeatedly  during  a  short  period  of  time.  The  following  are  readings  of  systolic 
pressure. 
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Manufacturer  Competitor 


132 

129 

134 

132 

129 

129 

129 

138 

130 

132 

a.  Test,  at  the  5%  level  of  significance,  whether  the  data  provide  sufficient 
evidence  to  conclude  that  the  manufacturer’s  claim  is  true. 

b.  Repeat  the  test  at  the  10%  level  of  significance.  Quote  as  many 
computations  from  part  (a)  as  possible. 


LARGE  DATA  SET  EXERCISES 


21.  Large  Data  Sets  1A  and  IB  record  SAT  scores  for  419  male  and  581  female 
students.  Test,  at  the  1%  level  of  significance,  whether  the  data  provide 
sufficient  evidence  to  conclude  that  the  variances  of  scores  of  male  and  female 
students  differ. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ datalA.xls 

http://www.gone.2012books.lardbucket.org/sites/all/files/datalB.xls 

22.  Large  Data  Sets  7,  7A,  and  7B  record  the  survival  times  of  140  laboratory  mice 
with  thymic  leukemia.  Test,  at  the  10%  level  of  significance,  whether  the  data 
provide  sufficient  evidence  to  conclude  that  the  variances  of  survival  times  of 
male  mice  and  female  mice  differ. 

http:/ / www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7A.xls 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data7B.xls 
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ANSWERS 


1. 

a. 

11.0, 

b. 

5.06, 

c. 

3.23 

3. 

a. 

0.23, 

b. 

0.25, 

c. 

0.40 

5. 

a. 

3.14, 

b. 

0.27, 

c. 

3.95, 

d. 

0.21 

7. 

Sample  1: 

a. 

n\  =  5, 

b. 

xi  =  3.8, 

c. 

s\  =  30.2. 

Sample  2: 

a. 

=  V, 

b. 

x  2  =  0.4286, 

c. 

^  =  3.95 

9. 

a. 

1.6563, 

b. 

J/j  -  15,4f2  =  20, 

c. 

^0.05  —  2.2 

d. 

do  not  reject  H 0 

11. 

a. 

0.5217 

b. 

#1  =  9  ,df2  =  12, 

c. 

F0.95  =  0.3254, 

d. 

do  not  reject  Ho 

13. 

a. 

0.1692 

b. 

dfi  =  8,  df2  =  30 

c. 

^0.975  =  0.26,^.025  —  2.65 

d. 

reject  Ho 

15. 

F  = 

0.3793, /3).9o  =  0.58,  reject  Ho 

17. 

F  = 

0.5499,  Fq  95  =  0.61,  reject  Hq 
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19.  5=0.0971,7^0.95  =  0.3 1 ,  reject  Ho 

21.  F  =  0.893131.  df [  =418and<i/2  —  580.  Rejection  Region: 

(0,0.  7897]  U  [1.2614,  oo)  .  Decision:  Fail  to  reject  Ho  of  equal 
variances. 
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11.4  F-Tests  in  One-Way  ANOVA 


LEARNING  OBJECTIVE 

1.  To  understand  how  to  use  an  F-test  to  judge  whether  several  population 
means  are  all  equal. 


In  Chapter  9  "Two-Sample  Problems"  we  saw  how  to  compare  two  population 
means  Hi  and  H2  •  In  this  section  we  will  learn  to  compare  three  or  more  population 
means  at  the  same  time,  which  is  often  of  interest  in  practical  applications.  For 
example,  an  administrator  at  a  university  may  be  interested  in  knowing  whether 
student  grade  point  averages  are  the  same  for  different  majors.  In  another  example, 
an  oncologist  may  be  interested  in  knowing  whether  patients  with  the  same  type  of 
cancer  have  the  same  average  survival  times  under  several  different  competing 
cancer  treatments. 


In  general,  suppose  there  are  K  normal  populations  with  possibly  different  means, 
Hi ,  fa,  ,  Hfa  but  all  with  the  same  variance  o2 .  The  study  question  is  whether  all 
the  K  population  means  are  the  same.  We  formulate  this  question  as  the  test  of 
hypotheses 


Ho  :  Hi  —  M2  —  '  '  '  — 
vs  .Ha  :  not  all  K  population  means  are  equal 

To  perform  the  test  K  independent  random  samples  are  taken  from  the  K  normal 
populations.  The  K  sample  means,  the  K  sample  variances,  and  the  K  sample  sizes 
are  summarized  in  the  table: 


Population 

Sample  Size 

Sample  Mean 

Sample  Variance 

1 

Til 

X] 

4 

2 

n2 

x 2 

4 

K 

n-K 

XK 

4 

Define  the  following  quantities: 
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The  combined  sample  size: 

n  =  n\  +  ri2  +  •  •  •  + 

The  mean  of  the  combined  sample  of  all  n  observations: 

_  Hx  n \X\  +  H2X2  +  •  •  •  +  UkXk 

n  n 


The  mean  square  for  treatment: 


MST  = 


_  _  _  _  r%  _  _  r\ 

n\(x\  —  xY  +  H2(X2  -  x)  +  •  •  •  +  Uk{Xk  —  x) 

K- 1 


The  mean  square  for  error: 

(ni-\)s\  +  (^2  —  1)  ^1  +  •  •  •  +  (nK-\)s2K 

MSh,  =  - 

n-K 


MST9  can  be  thought  of  as  the  variance  between  the  K  individual  independent 
random  samples  and  MSE10  as  the  variance  within  the  samples.  This  is  the  reason 
for  the  name  “analysis  of  variance,”  universally  abbreviated  ANOVA11.  The 
adjective  “one-way”  has  to  do  with  the  fact  that  the  sampling  scheme  is  the 
simplest  possible,  that  of  taking  one  random  sample  from  each  population  under 
consideration,  if  the  means  of  the  K  populations  are  all  the  same  then  the  two 
quantities  MST  and  MSE  should  be  close  to  the  same,  so  the  null  hypothesis  will  be 
rejected  if  the  ratio  of  these  two  quantities  is  significantly  greater  than  1.  This 
yields  the  following  test  statistic  and  methods  and  conditions  for  its  use. 


9.  Mean  square  for  treatment. 

10.  Mean  square  for  error. 

11.  Analysis  of  variance. 
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Test  Statistic  for  Testing  the  Null  Hypothesis  that  K 
Population  Means  Are  Equal 

MST 

F  =  - 

MSE 

If  the  K  populations  are  normally  distributed  with  a  common  variance  and  if 
Ho  :  /.i l  —  ■  ■  ■  —  nK  is  true  then  under  independent  random  sampling  F 
approximately  follows  an  F-distribution  with  degrees  of  freedom  df  \  —  K—  1 
and  df2  —  n  —  K. 

The  test  is  right-tailed:  Ho  is  rejected  at  level  of  significance  a  if  F  >  Fa. 


As  always  the  test  is  performed  using  the  usual  five-step  procedure. 
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EXAMPLE  8 


The  average  of  grade  point  averages  (GPAs)  of  college  courses  in  a  specific 
major  is  a  measure  of  difficulty  of  the  major.  An  educator  wishes  to  conduct 
a  study  to  find  out  whether  the  difficulty  levels  of  different  majors  are  the 
same.  For  such  a  study,  a  random  sample  of  major  grade  point  averages 
(GPA)  of  11  graduating  seniors  at  a  large  university  is  selected  for  each  of  the 
four  majors  mathematics,  English,  education,  and  biology.  The  data  are 
given  in  Table  11.17  "Difficulty  Levels  of  College  Majors".  Test,  at  the  5% 
level  of  significance,  whether  the  data  contain  sufficient  evidence  to 
conclude  that  there  are  differences  among  the  average  major  GPAs  of  these 
four  majors. 


TABLE  11.17  DIFFICULTY  LEVELS  OF  COLLEGE  MAJORS 


Mathematics 

English 

Education 

Biology 

2.59 

3.64 

4.00 

2.78 

3.13 

3.19 

3.59 

3.51 

2.97 

3.15 

2.80 

2.65 

2.50 

3.78 

2.39 

3.16 

2.53 

3.03 

3.47 

2.94 

3.29 

2.61 

3.59 

2.32 

2.53 

3.20 

3.74 

2.58 

3.17 

3.30 

3.77 

3.21 

2.70 

3.54 

3.13 

3.23 

3.88 

3.25 

3.00 

3.57 

2.64 

4.00 

3.47 

3.22 

Solution: 


•  Step  1.  The  test  of  hypotheses  is 


11.4  F-Tests  in  One-Way  ANOVA 


690 


Chapter  11  Chi-Square  Tests  and  F-Tests 


Ho  •  AO  —  AO  —  AO  —  AO 

vs.  Ha  :  not  all  four  population  means  are  equal  @  a  =  0.05 

•  Step  2.  The  test  statistic  is  F  =  MST  /  MSE  with  (since  n  =  44  and  K 
=  4)  degrees  of  freedom  df  y  =  —  1  =  4  —  1  =  3  and 

df2  —  n  —  K  =  44  —  4  =  40. 

•  Step  3.  if  we  index  the  population  of  mathematics  majors  by  1, 

English  majors  by  2,  education  majors  by  3,  and  biology  majors 
by  4,  then  the  sample  sizes,  sample  means,  and  sample  variances 
of  the  four  samples  in  Table  11.17  "Difficulty  Levels  of  College 
Majors"  are  summarized  (after  rounding  for  simplicity)  by: 


Major 

Sample  Size 

Sample  Mean 

Sample  Variance 

Mathematics 

n  |  =  II 

xi  —  2.90 

s\  =  0.188 

English 

n2  -  11 

J2  =  3.34 

4  =  0.148 

Education 

»3  =  11 

X3  —  3.36 

vj  =  0.229 

Biology 

ft4  =  1  1 

X4  —  3.02 

s2a  =  0.157 

The  average  of  all  44  observations  is  (after  rounding  for 
simplicity)  X  =  3. 15.  We  compute  (rounding  for  simplicity) 


MST 


11(2.90-3.15)  2  +  11(3.34 


1.7556 

3 

0.585 


3.15)  2  +  11(3.36-3.15 
4-  1 


and 


11.4  F-Tests  in  One-Way  ANOVA 


691 


Chapter  11  Chi-Square  Tests  and  F-Tests 


MSE 


(11  -  1)  (0.188)  +  (11  -  1)  (0.148)  +  (11  -  1)  (0.229) 

44-4 


7.22 

40 

0.181 


so  that 


MST 

MSE 


0.585 

0.181 


3.232 


•  Step  4.  The  test  is  right-tailed.  The  single  critical  value  is  (since 
dfy  =  3  and  df  2  —  40 )Fa  =  Fq  05  =  2.84.  Thus  the  rejection 
region  is  [2.84,  oo)  ,  as  illustrated  in  Figure  11.12. 


Figure  11.12 

Note  11.36  "Example  8" 

Rejection  Region 


•  Step  5.  Since  F  =  3.232  >  2.84  ,  we  reject  Ho.  The  data  provide 
sufficient  evidence,  at  the  5%  level  of  significance,  to  conclude  that  the 
averages  of  major  GPAs  for  the  four  majors  considered  are  not  all  equal. 
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EXAMPLE  9 


A  research  laboratory  developed  two  treatments  which  are  believed  to  have 
the  potential  of  prolonging  the  survival  times  of  patients  with  an  acute  form 
of  thymic  leukemia.  To  evaluate  the  potential  treatment  effects  33 
laboratory  mice  with  thymic  leukemia  were  randomly  divided  into  three 
groups.  One  group  received  Treatment  1,  one  received  Treatment  2,  and  the 
third  was  observed  as  a  control  group.  The  survival  times  of  these  mice  are 
given  in  Table  11,18  "Mice  Survival  Times  in  Days".  Test,  at  the  1%  level  of 
significance,  whether  these  data  provide  sufficient  evidence  to  confirm  the 
belief  that  at  least  one  of  the  two  treatments  affects  the  average  survival 
time  of  mice  with  thymic  leukemia. 


TABLE  11.18  MICE  SURVIVAL  TIMES  IN  DAYS 


Treatment  1 

Treatment  2 

Control 

71 

75 

77 

81 

72 

73 

67 

79 

75 

72 

79 

73 

80 

65 

78 

71 

60 

63 

81 

75 

65 

69 

72 

84 

63 

64 

71 

77 

78 

71 

84 

67 

91 

Solution: 


•  Step  1.  The  test  of  hypotheses  is 

Ho  :  !*\  =  =  ^3 

vs.  Ha  :  not  all  three  population  means  are  equal  @  a  =  0. 
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•  Step  2.  The  test  statistic  is  F  =  MST  /  MSE  with  (since  n  =  33  and  K 
=  3 )  degrees  of  freedom  df  y  =  K—  1  =  3  —  1  =  2  and 

df2=n-K  =  33  -  3  =  30. 

•  Step  3.  if  we  index  the  population  of  mice  receiving  Treatment  1 
by  1,  Treatment  2  by  2,  and  no  treatment  by  3,  then  the  sample 
sizes,  sample  means,  and  sample  variances  of  the  three  samples 
in  Table  11.18  "Mice  Survival  Times  in  Days"  are  summarized 
(after  rounding  for  simplicity)  by: 


Group 

Sample  Size 

Sample  Mean 

Sample  Variance 

Treatment  1 

n\  —  16 

I,  =  69.75 

5 1  =  34.47 

Treatment  2 

n2  —  9 

J2  =  77.78 

sj  =  52.69 

Control 

OO 

II 

CO 

J3  =  75.88 

^  =  30.69 

The  average  of  all  33  observations  is  (after  rounding  for 
simplicity)  X  =  73.42.  We  compute  (rounding  for  simplicity) 

16(69.75  -  73.42)  2  +  9(77.78  -  73.42)  2  +  8(75.88  -  73 

MST  =  — - - - - - - 

3  -  1 


and 


MSE  = 


(16-  1)  (34.47)  +  (9  -  1)  (52.69)  +  (8 

33-3 


1)  (30.69) 


so  that 


MST 

MSE 


217.50 

38.45 


5.65 


•  Step  4.  The  test  is  right-tailed.  The  single  critical  value  is 
Ea  =  Fo.oi  —  5.39.  Thus  the  rejection  region  is  [5.39,  oo)  ,  as 
illustrated  in  Figure  11,13. 

Figure  11.13 
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Note  11.37  "Example  9" 

Rejection  Region 


•  Step  5.  Since  F  =  5.65  >  5.39 ,  we  reject  Ho.  The  data  provide 
sufficient  evidence,  at  the  1%  level  of  significance,  to  conclude  that  a 
treatment  effect  exists  at  least  for  one  of  the  two  treatments  in 
increasing  the  mean  survival  time  of  mice  with  thymic  leukemia. 


It  is  important  to  to  note  that,  if  the  null  hypothesis  of  equal  population  means  is 
rejected,  the  statistical  implication  is  that  not  all  population  means  are  equal.  It 
does  not  however  tell  which  population  mean  is  different  from  which.  The 
inference  about  where  the  suggested  difference  lies  is  most  frequently  made  by  a 
follow-up  study. 


KEY  TAKEAWAY 


•  An  f-test  can  be  used  to  evaluate  the  hypothesis  that  the  means  of 
several  normal  populations,  all  with  the  same  standard  deviation,  are 
identical. 
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1.  The  following  three  random  samples  are  taken  from  three  normal  populations 

9 

with  respective  means  fiy ,  and  yW3 ,  and  the  same  variance  G  . 


Sample  1 

Sample  2 

Sample  3 

2 

3 

0 

2 

5 

1 

3 

7 

2 

5 

1 

3 

a.  Find  the  combined  sample  size  n. 

b.  Find  the  combined  sample  mean  X. 

c.  Find  the  sample  mean  for  each  of  the  three  samples. 

d.  Find  the  sample  variance  for  each  of  the  three  samples. 

e.  Find  MST. 

f.  Find  MSE. 

g.  Find  F  =  MST  /  MSE. 

2.  The  following  three  random  samples  are  taken  from  three  normal  populations 

2 

with  respective  means  and  jW3 ,  and  a  same  variance  G  . 


Sample  1 

Sample  2 

Sample  3 

0.0 

1.3 

0.2 

0.1 

1.5 

0.2 

0.2 

1.7 

0.3 

0.1 

0.5 

0.0 

a.  Find  the  combined  sample  size  n. 

b.  Find  the  combined  sample  mean  X. 

c.  Find  the  sample  mean  for  each  of  the  three  samples. 

d.  Find  the  sample  variance  for  each  of  the  three  samples. 

e.  Find  MST. 
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f.  Find  MSE. 

g.  Find  F  =  MST/MSE. 

3.  Refer  to  Exercise  1. 

a.  Find  the  number  of  populations  under  consideration  K. 

b.  Find  the  degrees  of  freedom  dfi  =  K—  1  and  df \  —  n  —  K. 

c.  For  a  =  0.05 ,  find  Fa  with  the  degrees  of  freedom  computed  above. 

d.  At  a  =  0.05 ,  test  hypotheses 

Ho  :  1*1  =1*2=  ^3 

vs  .Ha  :  at  least  one  pair  of  the 

population  means  are  not  equal 

4.  Refer  to  Exercise  2. 

a.  Find  the  number  of  populations  under  consideration  K. 

b.  Find  the  degrees  of  freedoms  dfi  =  K—  1  and  df \  —  n  —  K. 

c.  For  a  =  0.0 1 ,  find  Fa  with  the  degrees  of  freedom  computed  above. 

d.  At  a  =  0.01 ,  test  hypotheses 

Ho  :  i*x  =  Ii2  = 

vs  .Ha  :  at  least  one  pair  of  the 

population  means  are  not  equal 


APPLICATIONS 


5.  The  Mozart  effect  refers  to  a  boost  of  average  performance  on  tests  for 
elementary  school  students  if  the  students  listen  to  Mozart’s  chamber  music 
for  a  period  of  time  immediately  before  the  test.  In  order  to  attempt  to  test 
whether  the  Mozart  effect  actually  exists,  an  elementary  school  teacher 
conducted  an  experiment  by  dividing  her  third-grade  class  of  15  students  into 
three  groups  of  5.  The  first  group  was  given  an  end-of-grade  test  without 
music;  the  second  group  listened  to  Mozart’s  chamber  music  for  10  minutes; 
and  the  third  groups  listened  to  Mozart’s  chamber  music  for  20  minutes  before 
the  test.  The  scores  of  the  15  students  are  given  below: 


Group  1 

Group  2 

Group  3 

80 

79 

73 

63 

73 

82 
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Group  1 

Group  2 

Group  3 

74 

74 

79 

71 

77 

82 

70 

81 

84 

Using  the  ANOVA  F-test12  at  (X  =  0. 10,  is  there  sufficient  evidence  in  the 
data  to  suggest  that  the  Mozart  effect  exists? 

6.  The  Mozart  effect  refers  to  a  boost  of  average  performance  on  tests  for 
elementary  school  students  if  the  students  listen  to  Mozart’s  chamber  music 
for  a  period  of  time  immediately  before  the  test.  Many  educators  believe  that 
such  an  effect  is  not  necessarily  due  to  Mozart’s  music  per  se  but  rather  a 
relaxation  period  before  the  test.  To  support  this  belief,  an  elementary  school 
teacher  conducted  an  experiment  by  dividing  her  third-grade  class  of  15 
students  into  three  groups  of  5.  Students  in  the  first  group  were  asked  to  give 
themselves  a  self-administered  facial  massage;  students  in  the  second  group 
listened  to  Mozart’s  chamber  music  for  15  minutes;  students  in  the  third  group 
listened  to  Schubert’s  chamber  music  for  15  minutes  before  the  test.  The 
scores  of  the  15  students  are  given  below: 


Group  1 

Group  2 

Group  3 

79 

82 

80 

81 

84 

81 

80 

86 

71 

89 

91 

90 

86 

82 

86 

Test,  using  the  ANOVA  F-test  at  the  10%  level  of  significance,  whether  the  data 
provide  sufficient  evidence  to  conclude  that  any  of  the  three  relaxation 
method  does  better  than  the  others. 


12.  a  test  based  on  an  F 

statistic  to  check  whether 
several  population  means 
are  equal. 


7.  Precision  weighing  devices  are  sensitive  to  environmental  conditions. 
Temperature  and  humidity  in  a  laboratory  room  where  such  a  device  is 
installed  are  tightly  controlled  to  ensure  high  precision  in  weighing.  A  newly 
designed  weighing  device  is  claimed  to  be  more  robust  against  small  variations 
of  temperature  and  humidity.  To  verify  such  a  claim,  a  laboratory  tests  the 
new  device  under  four  settings  of  temperature-humidity  conditions.  First,  two 
levels  of  high  and  low  temperature  and  two  levels  of  high  and  low  humidity  are 
identified.  Let  T  stand  for  temperature  and  H  for  humidity.  The  four 
experimental  settings  are  defined  and  noted  as  (T,  H ):  (high,  high),  (high,  low), 
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(low,  high),  and  (low,  low).  A  pre-calibrated  standard  weight  of  1  kg  was 
weighed  by  the  new  device  four  times  in  each  setting.  The  results  in  terms  of 
error  (in  micrograms  meg)  are  given  below: 


(high,  high) 

(high,  low) 

(low,  high) 

(low,  low) 

-1.50 

11.47 

-14.29 

5.54 

-6.73 

9.28 

-18.11 

10.34 

11.69 

5.58 

-11.16 

15.23 

-5.72 

10.80 

-10.41 

-5.69 

Test,  using  the  ANOVA  F-test  at  the  1%  level  of  significance,  whether  the  data 
provide  sufficient  evidence  to  conclude  that  the  mean  weight  readings  by  the 
newly  designed  device  vary  among  the  four  settings. 

8.  To  investigate  the  real  cost  of  owning  different  makes  and  models  of  new 
automobiles,  a  consumer  protection  agency  followed  16  owners  of  new 
vehicles  of  four  popular  makes  and  models,  call  them  T C ,  HA  ,  NA  ,  and 
FT ,  and  kept  a  record  of  each  of  the  owner’s  real  cost  in  dollars  for  the  first 
five  years.  The  five-year  costs  of  the  16  car  owners  are  given  below: 


TC 

HA 

NA 

FT 

8423 

7776 

8907 

10333 

7889 

7211 

9077 

9217 

8665 

6870 

8732 

10540 

7129 

9747 

7359 

8677 

Test,  using  the  ANOVA  F-test  at  the  5%  level  of  significance,  whether  the  data 
provide  sufficient  evidence  to  conclude  that  there  are  differences  among  the 
mean  real  costs  of  ownership  for  these  four  models. 

9.  Helping  people  to  lose  weight  has  become  a  huge  industry  in  the  United  States, 
with  annual  revenue  in  the  hundreds  of  billion  dollars.  Recently  each  of  the 
three  market-leading  weight  reducing  programs  claimed  to  be  the  most 
effective.  A  consumer  research  company  recruited  33  people  who  wished  to 
lose  weight  and  sent  them  to  the  three  leading  programs.  After  six  months 
their  weight  losses  were  recorded.  The  results  are  summarized  below: 


Statistic 

Prog. 1 

Prog. 2 

Prog. 3 

Sample  Mean 

x\  =  10.65 

X2  =  8.90 

X3  =  9.33 
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Statistic 

Prog. 1 

Prog. 2 

Prog. 3 

Sample 

Variance 

s\  =  27.20 

s\  =  16.86 

^  =  32.40 

Sample  Size 

n\  =  11 

n2  =  11 

n3  =  n 

The  mean  weight  loss  of  the  combined  sample  of  all  33  people  was  X  =  9.63. 
Test,  using  the  ANOVA  F-test  at  the  5%  level  of  significance,  whether  the  data 
provide  sufficient  evidence  to  conclude  that  some  program  is  more  effective 
than  the  others. 

10.  A  leading  pharmaceutical  company  in  the  disposable  contact  lenses  market  has 
always  taken  for  granted  that  the  sales  of  certain  peripheral  products  such  as 
contact  lens  solutions  would  automatically  go  with  the  established  brands.  The 
long-standing  culture  in  the  company  has  been  that  lens  solutions  would  not 
make  a  significant  difference  in  user  experience.  Recent  market  research 
surveys,  however,  suggest  otherwise.  To  gain  a  better  understanding  of  the 
effects  of  contact  lens  solutions  on  user  experience,  the  company  conducted  a 
comparative  study  in  which  63  contact  lens  users  were  randomly  divided  into 
three  groups,  each  of  which  received  one  of  three  top  selling  lens  solutions  on 
the  market,  including  one  of  the  company’s  own.  After  using  the  assigned 
solution  for  two  weeks,  each  participant  was  asked  to  rate  the  solution  on  the 
scale  of  1  to  5  for  satisfaction,  with  5  being  the  highest  level  of  satisfaction.  The 
results  of  the  study  are  summarized  below: 


Statistics 

Sol.  1 

Sol.  2 

Sol.  3 

Sample  Mean 

x\  =  3.28 

x2  =  3.96 

x3  =  4.10 

Sample  Variance 

s\  =  0.15 

s\  =  0.32 

si  =  0.36 

Sample  Size 

n\  =  18 

n2  =  23 

n3  =  22 

The  mean  satisfaction  level  of  the  combined  sample  of  all  63  participants  was 
X  =  3.81.  Test,  using  the  ANOVA  F-test  at  the  5%  level  of  significance, 
whether  the  data  provide  sufficient  evidence  to  conclude  that  not  all  three 
average  satisfaction  levels  are  the  same. 


LARGE  DATA  SET  EXERCISE 


11.  Large  Data  Set  9  records  the  costs  of  materials  (textbook,  solution  manual, 
laboratory  fees,  and  so  on)  in  each  of  ten  different  courses  in  each  of  three 
different  subjects,  chemistry,  computer  science,  and  mathematics.  Test,  at  the 
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1%  level  of  significance,  whether  the  data  provide  sufficient  evidence  to 
conclude  that  the  mean  costs  in  the  three  disciplines  are  not  all  the  same. 

http://www.gone.2012books.lardbucket.org/ sites/ all/ files/ data9.xls 


ANSWERS 

a. 

n  =  12, 

b. 

x  =  2.8333, 

c. 

X\  =  3,X2  =  5,  X3 

=  1, 

d. 

s\  =  1.5,4  =  4’4 

=  0.6667, 

e. 

MST  =  13.83  , 

f. 

MSE  =  1.78  , 

g- 

F=  7.7812 

a. 

K=3; 

b. 

dfi  =  2,  df2  =  9; 

c. 

^0.05  =  4.26; 

d. 

F  =  5.53,  reject  Hq 

5. 

F  = 

3.9647,  Fq.io  =  2. 81,  reject  Ho 

7. 

F  = 

9.6018,  Fq  oI  =  5.95,  reject  Ho 

9. 

F  = 

0.3589,  /3).05  =  3.32,  do  not  reject  Hq 

11. 

F  = 

1.418.  df  |  =  2  and  df2 

=  27.  Rejection  Region:  |5. 4881,  00^  . 

Decision:  Fail  to  reject  H o  of  equal  means. 
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Figure  12.1  Cumulative  Binomial  Probability 
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Cumulative  Binomial  Probability  P(X<x) 
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Figure  12.2  Cumulative  Normal  Probability 
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Cumulative  Probability  P(Z  <  z ) 
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Figure  12.3  Critical  Values  oft 
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Figure  12.4  Critical  Values  of  Chi-Square  Distributions 
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Figure  12.5  Upper  Critical  Values  of  F -Distributions 
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Figure  12.6  Lower  Critical  Values  of  F-Distributions 
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